mappings-01 and the general procedure
Erik van der Poel
erikv at google.com
Sun Jul 12 18:37:33 CEST 2009
In the mappings-01 draft, the "general procedure" is:
1. All characters are mapped using Unicode Normalization Form C
(NFC).
2. Upper case characters are mapped to their lower case equivalents
by using the algorithm for mapping Unicode characters.
3. Full-width and half-width characters (those defined with
Decomposition Types <wide> and <narrow>) are mapped to their
decomposition mappings as shown in the Unicode character
database.
Although mappings-01 clearly states that "an appliction[sp] might want
to implement" mappings that are more compatible with IDNA2003 instead,
I wonder whether implementors will figure out that the order of the
above steps is somewhat different from that of IDNA2003, and that some
strings would be mapped differently.
For example, let's take the following input string:
U+FF45 FULLWIDTH LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT
The mappings-01 procedure would map this string to the following:
U+0065 LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT
On the other hand, IDNA2003 would map it to:
U+00E9 LATIN SMALL LETTER E WITH ACUTE
This is because mappings-01 has NFC as the first step rather than the last.
Erik
More information about the Idna-update
mailing list