Json and U+08A1 and related cases (was: Re: [Json] Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))
Shawn.Steele at microsoft.com
Thu Jan 22 09:31:10 CET 2015
> > This is similar to spelling Hawai'I with or without an ʻokina (Hawaiʻi). A human thinks they look the same, but they aren't the same.
> I do not think it is similar to that case. as I understand it, there is no _in principle_ way one could distinguish these character/character sequences in practice, even with attention to the appearance of characters.
I'm confused one is used for one thing and the other is used for another thing. So they're distinguishable. And you can tell the difference by context. They may look the same, but there are numerous other homographs.
I'd also argue that eszett and ss is an even "worse" case (from my understanding). At least everyone knows that though some words have preferred forms, the ss is often alternate spelling for eszett. They're the same script, and either could appear in a word. At least in the U+08A1 case my understanding is that it should not appear interchangeably with the homograph, so only malicious users would use the wrong one in the wrong place, whereas in German there are legitimate reasons for confusing the two forms. If we can solve German (bundle or block), then we can solve this case.
> We're just trying to understand.
It seems to go beyond understanding. Oversimplifying, but Unicode folks have said "they're different, they're used in different contexts. Yes they look the same, but they're different and it's not appropriate to mix them." That seems fairly easy to understand, however there is still disagreement about whether or not they should be "the same".
> The reports keep getting worse, however.
Then participate :) There are very smart people in the UTC. There have been some unfortunate things in hindsight, however Unicode learns and does better. I'm reasonably confident that they're doing the best possible job of accommodating the requests and requirements with the appropriate amount of consideration. Sometimes I have different opinions, but this isn't an easy space, and the more regular UTC participants know way more than I about how scripts work.
One of the other comments started to move beyond the "it shouldn't be this way, why is it this way?" and started asking how to handle it as it is obviously a homograph and that could be confusing to users. I think that this should be handled as any other homograph, which seems to lead to prohibit (inappropriate in this case because it would prevent words from being spelled as designed), bundle, or block one if the other is registered first.
More information about the Idna-update