UTC Agenda Item: IDNA proposal

Patrik Fältström patrik at frobbit.se
Wed Nov 22 13:52:45 CET 2006

I have recreated the tables using a new algorithm (based on input  
from Kenneth mostly).

(1) Use the scripts.txt file for the script definitions, do not use  
the blocks definitions

(2) Remove codepoints where cp != NFKC(cp)

(3) Remove codepoints where cp != lowercase(cp)

(4) Remove codepoints where class(cp) != "Ll"

(5) Include codepoints that are part of US-ASCII (0-9, A-Z and a-z)

The result of doing this for U+0000 - U+FFFF can be found as


If I instead instep 4 accept things of class both Ll and Lo, then the  
result can be found as


Please let me know what you think.

I have this comment regarding one entry from class Lm:

>>  | Exclude  | U+02BB | U+02BB | Lm    | MODIFIER LETTER TURNED  
>> COMMA |
>>  | Exclude  | U+02BC | U+02BC | Lm    | MODIFIER LETTER  
> As ASCII isn't directly encodable using Punycode, one of these is  
> going
> to be needed to be allowed for Pacific languages, which use the
> apostrophe. eg, Hawaiʻi. It is often ignored, but in languages like
> Tongan it can make a difference.

I have not taken this into account when creating these tables.

     Regards, Patrik

More information about the Idna-update mailing list