Json and U+08A1 and related cases (was: Re: [Json] Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

Shawn Steele Shawn.Steele at microsoft.com
Wed Jan 21 21:07:25 CET 2015

>> I thought that NFC was closed to new precompositions though new 
>> precompositions might be added to Unicode.  That is, the NFC form of 
>> U+08A1 must be the same as the NFD form of U+08A1, which is to say: 
>> U+0628 U+0654.
> >Is my memory wrong about that?

> That is the understanding that several -- I dare to say most or all of  the IDNAbis WG participants -- of us had.  

My understanding of the rules is that no new characters could be added that would cause normalization of existing codepoints to change.  (They could be added with both precomposed and decomposed forms, but only if all of the codepoints were new, which seems unlikely as new script additions seem to not like to have both forms).

Anyway, so the behavior would be that any character that is added that happens to look like a different character sequence is therefore not the same as the other character.  In other words, by definition, U+08A1 is not the same thing as U+0628 U+0654, whether they look alike or not.

This is similar to spelling Hawai'I with or without an ʻokina (Hawaiʻi).  A human thinks they look the same, but they aren't the same.  

I don't know the history of why this code point was added, and cannot argue about whether or not it "should" have just used the existing code points, however I'm happy that smart people that know a lot more about scripts than I do felt that this was the right way to handle this character.  And, even if they got it "wrong" (for whatever value of "wrong"), really there are a lot of quirks about Unicode that are far more interesting than this character, so I'm not going to second guess the UTC.

The Unicode process is not the same as the IETF process, however it is possible to participate.  If people really feel this strongly about how Unicode encodes characters, then they should participate in the process.  It's not helpful to second guess what they did.  It's even less helpful for IDN to try to attempt to work around a problem by using codepoints in a manner that doesn't conform to The Unicode Standard.  (See Zawgyi for an extreme example).  

The UTC/Unicode/UTR15 decided that U+08A1 doesn't decompose to U+0628 U+0654.  IDN shouldn't try to change that behavior.  At this point if folks really feel strongly that Unicode did the wrong thing, then they should work with the UTC to try to convince them (realizing that at this point it would need an amazing case to either break the stability rules, deprecate the new code point, &/or publish some sort of addendum/errata).


More information about the Idna-update mailing list