Unicode 7.0.0, (combining) Hamza Above, and normalization

Paul Hoffman paul.hoffman at cybersecurity.org
Thu Aug 7 18:58:29 CEST 2014


A big +1 to what Andrew said. Vint may have been reacting viscerally, but the actual technical problem with this new character is (I believe) the normalization rules producing the wrong result. If I'm wrong, I'd be happy to be corrected.

--Paul Hoffman

On Aug 6, 2014, at 11:01 AM, Andrew Sullivan <ajs at anvilwalrusden.com> wrote:

> I'm not prepared to comment more generally right now, but I want to
> ensure we make one clear distinction:
> 
> On Wed, Aug 06, 2014 at 05:24:36PM +0000, Shawn Steele wrote:
> 
>> That's how I read Mark's statement.  The problem statement here
>> focuses on 1 character when rnicrosoft.com (in ASCII) can achieve
>> similar effect.  Or microsoft.unsafe.com or trustme.com/microsoft -
>> or just ignoring what the link says, or my bank sending me stuff
>> that takes me to a abcprocessing.com site, which trains me that
>> legitimate stuff can go to a 3rd party site. 
> 
> This is _not_ the current problem.  The above problems are real, but
> they're different.
> 
> The current problem we're talking about is one in which "the very same
> character" can be produced by a combining sequence and as a precomposed
> character, but where the normalization rules for the combining
> sequence and the precomposed character don't produce the same result.
> It is as if you produced o-diaeresis using U+006F and U+0308, and also
> produced it using U+00F6, but when you ran the results through NFC you
> didn't get a match.  Also, this is not cross-script: it's in the very
> same script.  
> 
> The difference in this case, as I understand Mark's argument, is that
> in the present case
> 
>   1. U+08A1 ARABIC LETTER BEH WITH HAMZA ABOVE
>   2. U+0628 ARABIC LETTER BEH + U+0654 ARABIC HAMZA ABOVE
> 
> (1) and (2) are _not_ "the very same character"; but
> 
>   A. U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
>   B. U+006F LATIN SMALL LETTER O + U+0308 COMBINING DIAERESIS
> 
> (A) and (b) _are_ "the very same character".  So NFC(1) != NFC(2) but
> NFC(A) == NFC(B).
> 
> I understand this argument.  I'm a little uncomfortable with the
> implications for IDNA, however.



More information about the Idna-update mailing list