Unicode 7.0.0, (combining) Hamza Above, and normalization

Wed Aug 6 20:01:30 CEST 2014

I'm not prepared to comment more generally right now, but I want to
ensure we make one clear distinction:

On Wed, Aug 06, 2014 at 05:24:36PM +0000, Shawn Steele wrote:

> That's how I read Mark's statement.  The problem statement here
> focuses on 1 character when rnicrosoft.com (in ASCII) can achieve
> similar effect.  Or microsoft.unsafe.com or trustme.com/microsoft -
> or just ignoring what the link says, or my bank sending me stuff
> that takes me to a abcprocessing.com site, which trains me that
> legitimate stuff can go to a 3rd party site. 

This is _not_ the current problem.  The above problems are real, but
they're different.

The current problem we're talking about is one in which "the very same
character" can be produced by a combining sequence and as a precomposed
character, but where the normalization rules for the combining
sequence and the precomposed character don't produce the same result.
It is as if you produced o-diaeresis using U+006F and U+0308, and also
produced it using U+00F6, but when you ran the results through NFC you
didn't get a match.  Also, this is not cross-script: it's in the very
same script.  

The difference in this case, as I understand Mark's argument, is that
in the present case

   1. U+08A1 ARABIC LETTER BEH WITH HAMZA ABOVE
   2. U+0628 ARABIC LETTER BEH + U+0654 ARABIC HAMZA ABOVE

(1) and (2) are _not_ "the very same character"; but

   A. U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
   B. U+006F LATIN SMALL LETTER O + U+0308 COMBINING DIAERESIS

(A) and (b) _are_ "the very same character".  So NFC(1) != NFC(2) but
NFC(A) == NFC(B).

I understand this argument.  I'm a little uncomfortable with the
implications for IDNA, however.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com