mappings-01 and the general procedure

Erik van der Poel erikv at google.com
Sun Jul 12 18:37:33 CEST 2009


In the mappings-01 draft, the "general procedure" is:

   1.  All characters are mapped using Unicode Normalization Form C
       (NFC).

   2.  Upper case characters are mapped to their lower case equivalents
       by using the algorithm for mapping Unicode characters.

   3.  Full-width and half-width characters (those defined with
       Decomposition Types <wide> and <narrow>) are mapped to their
       decomposition mappings as shown in the Unicode character
       database.

Although mappings-01 clearly states that "an appliction[sp] might want
to implement" mappings that are more compatible with IDNA2003 instead,
I wonder whether implementors will figure out that the order of the
above steps is somewhat different from that of IDNA2003, and that some
strings would be mapped differently.

For example, let's take the following input string:

U+FF45 FULLWIDTH LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT

The mappings-01 procedure would map this string to the following:

U+0065 LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT

On the other hand, IDNA2003 would map it to:

U+00E9 LATIN SMALL LETTER E WITH ACUTE

This is because mappings-01 has NFC as the first step rather than the last.

Erik


More information about the Idna-update mailing list