IAB Statement on Identifiers and Unicode 7.0.0

Shawn Steele Shawn.Steele at microsoft.com
Wed Jan 28 19:27:34 CET 2015


> The problem that the IAB sees, and that the statement was trying to convey, is that IDNA (and the nearly-done PRECIS WG also) was founded on a misapprehension of what Unicode could do for us.  We believed that the script made a difference, and that other properties of characters could be used to inform decision making.  Therefore, we could use derived properties (or even just use character properties
directly) as the basis for decisions.

You're looking for a magic potion where none exists.  Some properties might be a helpful hint, but there's no easy way to do this.

> But in the case of the characters we have called out directly (but as the recent discussion shows, there are apparently more lurking), there _is_ no property by which we could helpfully make a distinction.  We have to deal with the characters individually.

> For the IAB, this is a big deal because it strikes at the very basis of what IDNA and PRECIS are trying to do, which is exactly _not_ to have to look at every character to figure out whether there are nasty implications for identifiers.

What kinds of implications for identifiers?  At the machine level it's irrelevant, even if all of Unicode was allowed, they all have unique numeric values, and even with the NFC or other normalizations, the rules are applied consistently, so the binary values map consistently to a canonical form.

However if you're trying to write the identifier on paper, then you run into problems.  One of the most severe problems I've run into in my recent day-to-day life is that I named my Lego R2-D2 "L3-G0".  Where the last letter is a zero.  So http://L3-G0.blogspot.com  However being lazy or whatever, I tend to pronounce that as "El Three Gee Oh".  So then people end up going to the wrong place.  Even if I say "zero", unless I call out that they need to be careful typing it in, folks write a circle, which is either an O or a 0 when they type it in.  It's a little easier on the computer as most fonts have subtle distinctions between 0 and O, however I've taken to adding an explicit / through the 0 when I remember: http://youtu.be/PmXUBGq_uiU?t=1m47s (which I suppose with IDN could lead to a different kind of ambiguity, which I hadn't really thought of 'til now, oh well.)

> So, this is true, but not exactly relevant, because these examples all make it at least _possible_ to detect the distinction.  In sufficiently clear fonts (like I'm using now), you can tell the difference between "corn" and "com" and "1" and "l" (or for that matter, "l" and "I".  Why Apple continues to use a font that obliterates that distinction when displaying passwords to you I'll never understand).  We have in fact given advice to people about exactly these sorts of issues with identifiers.

I don't think that's helpful.  You can't depend on subtle font distinctions, eg "l" and "I".  What if the user is vision impaired?  What if they use a big blocky font to make it easier to read?  What if I use it in a place you'd expect the other?  (That's a briIIiant idea! - I can't tell that's wrong and I just typed it, that worked better than I expected).  What if they aren't a native Latin script user, so the distinctions are "harder" to see?

I don't think that an identifier can be expected to be unique and reliable as a unique token if some group of people could be confused by them.  I'd even go so far as to assert that the number of people that can be confused by existing stuff is quite large (pretty much everyone I've tried to provide a link to L3-G0's blog).

Another example that the WG agreed on: I-ı-İ-i round trips to i-ı-İ-i the way IDN is designed, please explain to me how that isn't confusing?  If I write the domain in block caps (I) it goes one place, if I write it in lower case (ı) it goes to a completely different place.  (We should've made these all map to "i", but we didn't).  This is far more confusing to real people that the code points under discussion (and I'm not sure how Unicode character properties could help).

I think that unique identifiers that aren't possible to be confused are a pretty good idea.  I'm pretty sure that we can't do that with IDN.  (Or even with legacy DNS if L3-G0 is an example.)  Maybe if we had a canonical form that mapped confusing things to the same thing that'd be a start, but it'd be as bad as punycode when you round tripped some cases, and would be a layer that we don't have right now.

-Shawn


More information about the Idna-update mailing list