IAB Statement on Identifiers and Unicode 7.0.0

Wed Jan 28 21:02:14 CET 2015

On 1/28/15 10:27 AM, Shawn Steele wrote:
>> The problem that the IAB sees, and that the statement was trying to convey, is that IDNA (and the nearly-done PRECIS WG also) was founded on a misapprehension of what Unicode could do for us.  We believed that the script made a difference, and that other properties of characters could be used to inform decision making.  Therefore, we could use derived properties (or even just use character properties directly) as the basis for decisions.
>>      
> You're looking for a magic potion where none exists.

No, no magic potion desired. You are constructing this as "perfection or 
nothing". That can't possibly be the desire when it comes to things that 
must be both "normal-human usable identifiers" and "machine-readable 
identifiers". There are always tradeoffs. But there was an assumption on 
the part of IDNA's design that (it appears) might be incorrect: That if 
you were within a particular script, there wouldn't be two characters 
that were homoglyphs (for most imaginable typefaces) *and* have all of 
their properties (letter vs. digit vs. ...) identical that aren't 
canonically equivalent.  We decided in IDNA that all other confusables 
were things where we could simply say, "implementer beware", but 
intra-script homoglyphs would always have some other Unicode property 
that distinguished them, not just "the language that they are used in" 
or other such non-lookupable-in-a-list-of-properties feature.

> Some properties might be a helpful hint, but there's no easy way to do this.
>    

Of course. Nobody (as far as I know) is asking for easy.

>> But in the case of the characters we have called out directly (but as the recent discussion shows, there are apparently more lurking), there _is_ no property by which we could helpfully make a distinction.  We have to deal with the characters individually.
>> For the IAB, this is a big deal because it strikes at the very basis of what IDNA and PRECIS are trying to do, which is exactly _not_ to have to look at every character to figure out whether there are nasty implications for identifiers.
>>      
> What kinds of implications for identifiers?  At the machine level it's irrelevant, even if all of Unicode was allowed, they all have unique numeric values, and even with the NFC or other normalizations, the rules are applied consistently, so the binary values map consistently to a canonical form.
>
> However if you're trying to write the identifier on paper, then you run into problems.  One of the most severe problems I've run into in my recent day-to-day life is that I named my Lego R2-D2 "L3-G0". Where the last letter is a zero.
>    

Good example. In a theoretical world where Latin characters could be 
disallowed by policy, your DNS registrar might say, "We're cool if you 
use all of the numbers and all of the letters of the Latin script, 
because even though you can confuse zero with capital O, at least you 
can explain to someone 'My domain name uses the digit, not the letter', 
and we can make the distinction about which one you are using 
automatedly by noticing that your domain name uses the digit, where as 
some other registrant is using the capital O.'" And it also turns out 
that you can write this down for someone (who is an English speaker) by 
using a 'typeface' that distinguishes the two by putting a slash through 
the zero or the like. And it turns out that font designers readily 
recognize that distinguishing these is important and mostly do so. But 
registrars don't have to worry that there is effectively no way to 
distinguish the two except for, "You're using this in French instead of 
Spanish", which doesn't appear as a Unicode property at all.

You might say that this confusability between zero and capital O is just 
as bad as other cases. In many ways, that's true. But we made a design 
decision that so long as there was some reasonable way in the Unicode 
properties to distinguish them, for identifier purposes, it simply 
wasn't as bad for our purposes.

> I don't think that an identifier can be expected to be unique and reliable as a unique token if some group of people could be confused by them.  I'd even go so far as to assert that the number of people that can be confused by existing stuff is quite large (pretty much everyone I've tried to provide a link to L3-G0's blog).
>    

Some of the identifiers we use for domain name will always be confusing. 
We're trying to make the mnemonic and human-usable, and that means there 
will be some confusion. But again, at least you can say, "with the digit 
zero, not with an O", and you can write it down with a slash through it. 
And when someone talks to your registrar, they can say, "Yeah, they're 
confusable, but one's a number and one's a letter".

The case we're talking about presently is, "They're both letters, both 
in the same script, and any sane font is likely to have them as 
indistinguishable visually, and the only way to distinguish them is the 
context in which you use them." It sure would be nice to have a 
normalization that said, "since they're only distinguishable by context, 
they are, for purposes of this normalization, equivalent". And I don't 
want IETF folks to figure that out. I want the IETF folks to explain to 
the Unicode folks what kind of things we want equivalent and for the 
Unicode folks to decide whether any particular set of code points meets 
those criteria.

pr

-- 
Pete Resnick<http://www.qualcomm.com/~presnick/>
Qualcomm Technologies, Inc. - +1 (858)651-4478