UTC Agenda Item: IDNA proposal
patrik at frobbit.se
Wed Nov 22 13:52:45 CET 2006
I have recreated the tables using a new algorithm (based on input
from Kenneth mostly).
(1) Use the scripts.txt file for the script definitions, do not use
the blocks definitions
(2) Remove codepoints where cp != NFKC(cp)
(3) Remove codepoints where cp != lowercase(cp)
(4) Remove codepoints where class(cp) != "Ll"
(5) Include codepoints that are part of US-ASCII (0-9, A-Z and a-z)
The result of doing this for U+0000 - U+FFFF can be found as
If I instead instep 4 accept things of class both Ll and Lo, then the
result can be found as
Please let me know what you think.
I have this comment regarding one entry from class Lm:
>> | Exclude | U+02BB | U+02BB | Lm | MODIFIER LETTER TURNED
>> COMMA |
>> | Exclude | U+02BC | U+02BC | Lm | MODIFIER LETTER
>> APOSTROPHE |
> As ASCII isn't directly encodable using Punycode, one of these is
> to be needed to be allowed for Pacific languages, which use the
> apostrophe. eg, Hawaiʻi. It is often ignored, but in languages like
> Tongan it can make a difference.
I have not taken this into account when creating these tables.
More information about the Idna-update