IDNNever.txt

Kenneth Whistler kenw at sybase.com
Tue Mar 6 02:43:16 CET 2007


Michel said:

> Except for 3007 and 30FB, afaik, all the other characters mentioned
> below are either mapped out or normalized out. It does not make much
> sense to have allowed characters in a list which are removed by idna
> nameprep.  I don't think anybody is arguing about allowing compatibility
> characters. Entries from the th and pl registries are just mistakes.

And in particular:

> >th-thai.html        U+002e

Everybody recognizes that that one is a mistake and should be removed.

> >th-thai.html        U+0e33

That is the precomposed SARA AM. It is ruled out by the 
NFKC(cp) != cp criterion.

It doesn't prevent people from registering names using a SARA AM for
Thai -- they would just have to spell it <U+0E4D, U+0E32> in
Unicode, instead of <U+0E33>. the nonequivalence would, however,
be something that user agents should be aware of.

> >pl-greek.html       U+0390
> >pl-greek.html       U+03b0

For these two, while no one doubts that these are used in
modern Greek orthography, the problem here is casefolding
stability:

0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS

03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS

In other words, when you casefold these, they expand to the base
character plus diacritic sequences, instead of staying unchanged.
(The reason for that has to do with the noexistence of uppercase
versions of these.)

In this instance, because the sequences <U+03B9, U+0308, U+0301>
and <U+03C5, U+0308, U+0301> would *re*compose under NFKC, you
have countervailing stabilities: The precomposed forms are
stable under NFKC normalization but not under casefolding,
while the decomposed forms are stable under casefolding but
not under NFKC normalization. In that situation it is best to
simply omit the characters from the permitted repertoire and
reckon with the fact that this is one aspect of modern Greek
orthography that is not carried into Internet labels (just
as apostrophes aren't for English or French).
 
> I had mentioned most of the mistakes below a while ago to the IANA/ICANN
> staff, but apparently it is the responsibility of the original
> submitters to do something about it.
> 
> Concerning 3007, and 30FB, there are already both in use in Japanese IDN
> names according to JPRS, so although they both have issues from a
> confusability issue, it could be problematic to remove them. So they are
> for sure not good candidate for an IDNNver.txt content.

Nor are they included in IDNNever.txt now.

The question, rather, is whether they should be added, exceptionally,
to IDNPermitted.txt.

The reason why U+3007 IDEOGRAPHIC ZERO was omitted is that it
is General_Category=Nl, rather than the rest of the digits
admitted as General_Category=Nd. But I agree that both because
of preestablished usage in Japan and because of the parallel
with other ideographs, U+3007 should be in the permitted list.

U+30FB KATAKANA MIDDLE DOT is more questionable. But if we
are going to allow U+00B7 MIDDLE DOT for Catalan (and other
languages), and if U+30FB is already used in Japan in
IDNs, then I don't see omitting just U+30FB. So it should
probably be another exceptional entry in IDNPermitted.txt.
As Michel points out, you get another obvious confusable,
but it would be straightforward enough to test for, if necessary.

Adding those two is easy enough to do, if people agree.

--Ken



More information about the Idna-update mailing list