Unicode 7.0.0, (combining) Hamza Above, and normalization
Andrew Sullivan
ajs at anvilwalrusden.com
Wed Aug 6 20:01:30 CEST 2014
I'm not prepared to comment more generally right now, but I want to
ensure we make one clear distinction:
On Wed, Aug 06, 2014 at 05:24:36PM +0000, Shawn Steele wrote:
> That's how I read Mark's statement. The problem statement here
> focuses on 1 character when rnicrosoft.com (in ASCII) can achieve
> similar effect. Or microsoft.unsafe.com or trustme.com/microsoft -
> or just ignoring what the link says, or my bank sending me stuff
> that takes me to a abcprocessing.com site, which trains me that
> legitimate stuff can go to a 3rd party site.
This is _not_ the current problem. The above problems are real, but
they're different.
The current problem we're talking about is one in which "the very same
character" can be produced by a combining sequence and as a precomposed
character, but where the normalization rules for the combining
sequence and the precomposed character don't produce the same result.
It is as if you produced o-diaeresis using U+006F and U+0308, and also
produced it using U+00F6, but when you ran the results through NFC you
didn't get a match. Also, this is not cross-script: it's in the very
same script.
The difference in this case, as I understand Mark's argument, is that
in the present case
1. U+08A1 ARABIC LETTER BEH WITH HAMZA ABOVE
2. U+0628 ARABIC LETTER BEH + U+0654 ARABIC HAMZA ABOVE
(1) and (2) are _not_ "the very same character"; but
A. U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
B. U+006F LATIN SMALL LETTER O + U+0308 COMBINING DIAERESIS
(A) and (b) _are_ "the very same character". So NFC(1) != NFC(2) but
NFC(A) == NFC(B).
I understand this argument. I'm a little uncomfortable with the
implications for IDNA, however.
Best regards,
A
--
Andrew Sullivan
ajs at anvilwalrusden.com
More information about the Idna-update
mailing list