IAB Statement on Identifiers and Unicode 7.0.0
Shawn.Steele at microsoft.com
Thu Jan 29 05:15:21 CET 2015
Since I was apparently too confused and focused on examples, I will attempt to state some of my concerns in hopefully a more intelligent manner. I'd like to say less tedious, but I don't think that's realistic.
> 2.1 IDNA Assumptions about Unicode normalization.
1-3 aren't things in UTR#15. (languages, labels). Those are presumably IDNA context for interpreting the rest of the section? (Yes, a nit).
> "the most important test is that, if two glyphs are the same within a given script, they must represent the same character no matter how they are formed."
If by "glyph" one means "things that look the same to a human eye", or "things render using exactly the same pixels" then IDNA2003 & IDNA2008 fail this test. If by "glyph" you instead mean symbols used to represent a concept or part of a concept, then IDNA2003 & IDNA2008 do a better job of passing the test.
> IDNA also assumed... stability rules would be applied and work as specified... (to end of section)
They do work as specified. This statement doesn't exactly say that they don't work as specified, but seems to suggest that the stability rules are either wrong or were violated.
> 2.2 New code point....
As discussed ad nauseum, the name and shape lead to the unfortunate conclusion that these are the same.
> As can be deduced from the name, it is visually identical to the glyph that can be formed from a combining sequence consisting of the code point for ARABIC LETTER BEH (U+0628) and the code point for Combining Hamza Above (U+0654)
Unicode doesn't encode visual things (code pages did that, sort of, with poor success), it encodes linguistic stuff. Someone besides me can probably use a more appropriate term. From a Unicode perspective this observation isn't particularly exciting.
> Had the issues outlined in this document been better understood at the time, it probably would have been wise for RFC 5892 to disallow either the precomposed character or the combining sequence of each pair in those cases in which Unicode normalization rules do not cause the right thing to happen...
Despite the preamble saying that "not imply(ing) that anyone is "wrong"", this statement asserts that normalization is wrong.
Disallowing the combining sequence (or precomposed character) is like saying that character is illegal. In other words, if there is an A and an A' then one should be disallowed. Therefore when I type it, I should use either A, or A', but not both. However language X may use spelling A and that's what's available on their keyboard and language Y may use A' and that's what's available on their keyboard. Disallowing one or the other may disenfranchise certain users. Granted in this example that may seem less obvious or maybe even unimportant, but it's a risky precedent.
> 2.3 Other examples of the same behavior.
This section is a simplified description of normalization. However by Unicode's definition of the codepoints, it is not "the same behavior" as 2.2. This is the behavior for sequences that Unicode does consider the same. (As opposed to the sequence in 2.2, which Unicode does not consider the same.)
> 2.4 Hamza...
Anything I said here would further muddle the problem :)
> 3 Proposed stuff
I prefer option 3.3 (do nothing other than warn), though I'd probably make a 3.3' saying "do nothing other than including this in the list of confusables that we already warn about" - IMO there's nothing special about this sequence.
3.1 IMO is a dangerous precedent, I really dislike the disallowed code points because it means someone can't do something they wanted to do with them. IT's also architecturally annoying and confusing to explain to folks.
3.2 Disallowing the combining sequence. This is even worse, since it would disallow something already allowed.
3.4 Normalization Form IETF.
Though I don't think this "solves" the immediate problem, I think this is worth additional consideration. But I don't know if it's feasible.
However I would go way farther and normalize things that would damage the reversibility of the transformation. I would attempt to normalize all concepts that could be confused to some canonical form. That would mean that any alternate spelling would be acceptable, but that the canonical form many not be "correctly spelled" for any of them. Barring errors, such a transformation would provide much more deterministic identifiers, however it would reduce the namespace of legal identifiers considerably. Confusability wouldn't have to be solely based on appearance, but could also be based on other confusability considerations. It would also be a huge breaking change to go that far as many names would map to similar labels.
I have no clue if the above is helpful.
More information about the Idna-update