IAB Statement on Identifiers and Unicode 7.0.0
presnick at qti.qualcomm.com
Wed Jan 28 21:02:14 CET 2015
On 1/28/15 10:27 AM, Shawn Steele wrote:
>> The problem that the IAB sees, and that the statement was trying to convey, is that IDNA (and the nearly-done PRECIS WG also) was founded on a misapprehension of what Unicode could do for us. We believed that the script made a difference, and that other properties of characters could be used to inform decision making. Therefore, we could use derived properties (or even just use character properties directly) as the basis for decisions.
> You're looking for a magic potion where none exists.
No, no magic potion desired. You are constructing this as "perfection or
nothing". That can't possibly be the desire when it comes to things that
must be both "normal-human usable identifiers" and "machine-readable
identifiers". There are always tradeoffs. But there was an assumption on
the part of IDNA's design that (it appears) might be incorrect: That if
you were within a particular script, there wouldn't be two characters
that were homoglyphs (for most imaginable typefaces) *and* have all of
their properties (letter vs. digit vs. ...) identical that aren't
canonically equivalent. We decided in IDNA that all other confusables
were things where we could simply say, "implementer beware", but
intra-script homoglyphs would always have some other Unicode property
that distinguished them, not just "the language that they are used in"
or other such non-lookupable-in-a-list-of-properties feature.
> Some properties might be a helpful hint, but there's no easy way to do this.
Of course. Nobody (as far as I know) is asking for easy.
>> But in the case of the characters we have called out directly (but as the recent discussion shows, there are apparently more lurking), there _is_ no property by which we could helpfully make a distinction. We have to deal with the characters individually.
>> For the IAB, this is a big deal because it strikes at the very basis of what IDNA and PRECIS are trying to do, which is exactly _not_ to have to look at every character to figure out whether there are nasty implications for identifiers.
> What kinds of implications for identifiers? At the machine level it's irrelevant, even if all of Unicode was allowed, they all have unique numeric values, and even with the NFC or other normalizations, the rules are applied consistently, so the binary values map consistently to a canonical form.
> However if you're trying to write the identifier on paper, then you run into problems. One of the most severe problems I've run into in my recent day-to-day life is that I named my Lego R2-D2 "L3-G0". Where the last letter is a zero.
Good example. In a theoretical world where Latin characters could be
disallowed by policy, your DNS registrar might say, "We're cool if you
use all of the numbers and all of the letters of the Latin script,
because even though you can confuse zero with capital O, at least you
can explain to someone 'My domain name uses the digit, not the letter',
and we can make the distinction about which one you are using
automatedly by noticing that your domain name uses the digit, where as
some other registrant is using the capital O.'" And it also turns out
that you can write this down for someone (who is an English speaker) by
using a 'typeface' that distinguishes the two by putting a slash through
the zero or the like. And it turns out that font designers readily
recognize that distinguishing these is important and mostly do so. But
registrars don't have to worry that there is effectively no way to
distinguish the two except for, "You're using this in French instead of
Spanish", which doesn't appear as a Unicode property at all.
You might say that this confusability between zero and capital O is just
as bad as other cases. In many ways, that's true. But we made a design
decision that so long as there was some reasonable way in the Unicode
properties to distinguish them, for identifier purposes, it simply
wasn't as bad for our purposes.
> I don't think that an identifier can be expected to be unique and reliable as a unique token if some group of people could be confused by them. I'd even go so far as to assert that the number of people that can be confused by existing stuff is quite large (pretty much everyone I've tried to provide a link to L3-G0's blog).
Some of the identifiers we use for domain name will always be confusing.
We're trying to make the mnemonic and human-usable, and that means there
will be some confusion. But again, at least you can say, "with the digit
zero, not with an O", and you can write it down with a slash through it.
And when someone talks to your registrar, they can say, "Yeah, they're
confusable, but one's a number and one's a letter".
The case we're talking about presently is, "They're both letters, both
in the same script, and any sane font is likely to have them as
indistinguishable visually, and the only way to distinguish them is the
context in which you use them." It sure would be nice to have a
normalization that said, "since they're only distinguishable by context,
they are, for purposes of this normalization, equivalent". And I don't
want IETF folks to figure that out. I want the IETF folks to explain to
the Unicode folks what kind of things we want equivalent and for the
Unicode folks to decide whether any particular set of code points meets
Qualcomm Technologies, Inc. - +1 (858)651-4478
More information about the Idna-update