Confusability (Re: New version, draft-faltstrom-idnabis-tables-02.txt, available)

Tue Jun 19 01:30:07 CEST 2007

John Klensin said:

> * At least partially on the advice of UTC members, IDNA2003
> excluded invisible characters, such as zero-width ones, and
> other characters that were generally ignored, because they would
> be an opportunity for confusion... confusion to the point that I
> refer to above as "phisher's paradise".   But, because of fairly
> fundamental decisions about presentation made in the compilation
> of Unicode, one cannot sensibly construct a wide range of
> mnemonics based on a number of languages --notably Indic and
> Arabic-based scripts -- so it became important to deal with ZWB
> and ZWNB in some way.

recte: ZWJ and ZWNJ ("joiner" and "non-joiner")

And please note that as the UTC has examined this issue
in more detail, the number of required contexts has been
pared down to a mere handful currently -- all of which
can be described in terms of constrained regular expressions.

In particular, ZWNJ seems only required in Persian (not
other languages using the Arabic script) in some specific
contexts, and then for the Malayalam and Khmer scripts,
also in very specific contexts.

ZWJ seems only required in the Sinhala script.

That is not to say that ZWNJ and ZWJ aren't much more widely
used in the Arabic script and in many Indian scripts for
presentational purposes -- but the few instances above
are the only ones we currently know about where important semantic 
distinctions require the presence of a ZWNJ or a ZWJ to be "spelled"
correctly, from the point of view of an end user. The UTC is
engaged in a dialogue with the Government of India now
to determine if there are other specialized contexts of
this sort involving ZWNJ or ZWJ for any of the scripts of India.

See:

http://www.unicode.org/review/pr-96.html

for details.

> That, in turn, requires a contextual
> rule, which the design of IDNA2003 prohibited.
> 
> * We've got a similar problem with IPA.  The first version of
> the tables document excluded the IPA block entirely.  As Harald
> mentioned, that resulted in two strong criticisms.  One was that
> many of the characters had been adopted into African languages
> and (presumably because there were no extant national or
> international standards that were specific to those languages)
> the IPA characters had to be used if reasonable mnemonics were
> to be constructed based on words of those languages.  The other
> was that we should refrain from writing rules based on character
> blocks, rather than on property lists.

And not too surprisingly, I think characters of the IPA block
belong in the
table for both of those reasons. From the point of the
Unicode Standard, IPA is simply a specialized usage of
a subset of Latin letters. Note that IPA also includes
U+0061 LATIN SMALL LETTER A, U+0062 LATIN SMALL LETTER B,
and so on, for all of lowercase ASCII.

The only reason there is an "IPA block" in Unicode at all
is as a convenience for organizing various extensions to
the Latin script. The IPA block neither contains all of
the IPA characters, nor are the characters in the IPA
block constrained to use only in IPA.

> So now we have IPA back
> in, which cases problems with IPA characters that are basically
> font variations on basic Latin ones.

The only IPA character that I think arguably falls into
that category is U+0261 LATIN SMALL LETTER SCRIPT G.
(It is actually the open tail form of the "g", rather
than a true script form. Cf. U+210A SCRIPT SMALL G, which
really is the true script form of the letter, and which
no one is arguing should be included, because it really
is a font variant.)

I suppose that there would also be concern about the
small capital letters also used in IPA (and other
Latin-based phonetic orthographies), but in writing
systems that use these letters, these are not considered
font variants of the lowercase letters, nor are they
uppercase letters; they are letters in their own right.
And they do not fold to lowercase letters under
casefolding.

>  It is clear to me (at
> least) that we can't have any font variations in the
> IDN-permitted set and, indeed, that such variations must be
> forever excluded if we are not to have major problems.

In general, I definitely agree with that sentiment. And,
in fact, we (of the UTC) are recommending the wholesale
exclusion of all the letterlike symbols in Unicode that
have font-type compatibility decompositions (see
U+2100..U+214F for many of those). But the overriding
concern is that such characters are unstable under
NFKC(cp) -- a situation which does not apply to the basic
Latin letters used in IPA.

>  But that
> means we need to either exclude the IPA block and then permit
> some specific characters, or that we need to include the IPA
> block and then prohibit many specific characters, or that we are
> not going to be able to use existing Unicode properties or
> something derived from them to define the IDN sets.

I'm inclined to agree with Gervase about this. The few
lookalike characters amidst the IPA block of Latin characters
are a mere subset of all the confusability issues among
Latin, Greek, and Cyrillic characters -- and not even a
very prominent or important one at that. I think they
are more profitably addressed by layers concerned with
confusability and restrictions to supported subsets for
particular languages in particular registries, and so on,
rather than spending time trying to track down or
invent some Unicode property-based means of distinguishing,
for example, U+026A LATIN LETTER SMALL CAPITAL I for
the table used by the IDNA protocol per se.

--Ken

> 
> That is not a happy situation in terms of cleanliness of design
> or definitions, but I don't see how to avoid it.