IAB Statement on Identifiers and Unicode 7.0.0

Wed Jan 28 15:34:06 CET 2015

Hi,

Two messages at once, here.

On Wed, Jan 28, 2015 at 09:15:39AM +0100, Mark Davis ?️ wrote:
> 
> That is, however, a false analogy.

Well, all analogies are false (and true) in some respects.  I think
it's fair to argue that there are different severities here.  However,

> *Many characters and sequences cause confusability problems, and in
> comparison U+08A1 is not that important.*

this is no merely an issue of simple confusability.  I expect you have
not had a chance to read my note to Shawn from last night, but there's
a mighty important difference in this case, because the distinction in
question is a linguistic one, and that's not information carried with
the characters.

> And, like so many of these discussions, there is *no* data behind any of
> this Sturm und Drang around U+08A1 and related characters. If the IETF were
> serious about these issues, it would gather the data to see where the
> biggest problems are* in reality*. It would then focus on the biggest
> ticket items, to see if it can come up with solutions to those.

I think perhaps this misses why the issue is important to the IAB,
presumably because the statement didn't make this clear enough.  The
worry here is not this or that character and confusability.  That's a
problem, for sure, but not an architectural one.

The problem that the IAB sees, and that the statement was trying to
convey, is that IDNA (and the nearly-done PRECIS WG also) was founded
on a misapprehension of what Unicode could do for us.  We believed
that the script made a difference, and that other properties of
characters could be used to inform decision making.  Therefore, we
could use derived properties (or even just use character properties
directly) as the basis for decisions.

But in the case of the characters we have called out directly (but as
the recent discussion shows, there are apparently more lurking), there
_is_ no property by which we could helpfully make a distinction.  We
have to deal with the characters individually.

For the IAB, this is a big deal because it strikes at the very basis
of what IDNA and PRECIS are trying to do, which is exactly _not_ to
have to look at every character to figure out whether there are nasty
implications for identifiers.

The position you seem to be adopting is that there's just no way
around this.  If that's true, then the problem is even worse than I
imagined and suggests that we're going to need much more radical
answers for identifiers -- things like carried-along language
identification, or I don't know what -- that would appear to strike at
the basic utility of using Unicode at all for this.  I'm sure neither
of us thinks that's wise.

On Wed, Jan 28, 2015 at 09:43:57AM +0100, Mark Davis ?️ wrote:
> That level of "unambiguous" was impossible, even before Unicode.
> 
> Take 8859-5, with both o and Russian o, or ASCII with "google.corn" vs "
> goog1e.com". [Both the 1 and lowercase L are an issue, but also in many
> fonts—in common use—users will read the (r + n) in the former as an m.]

So, this is true, but not exactly relevant, because these examples all
make it at least _possible_ to detect the distinction.  In
sufficiently clear fonts (like I'm using now), you can tell the
difference between "corn" and "com" and "1" and "l" (or for that
matter, "l" and "I".  Why Apple continues to use a font that
obliterates that distinction when displaying passwords to you I'll
never understand).  We have in fact given advice to people about
exactly these sorts of issues with identifiers.

As for o vs. a Cyrillic o, of course, they have different script
properties.  This is _exactly_ why the script property has turned out
to be so important for IDNA, and why the IAB statement is quite
explicit that the important case here is that the characters are all
in the same script.  (It's too bad that it doesn't discuss all the
other ways the properties also don't help, but I suspect anyway that
would have made the statement too long to read.)

I hope this helps to clarify the particular worry we're talking about.
I am not trying to suggest that other confusable issues are
unimportant.  And I'm not trying to suggest that UTC has done anything
wrong here.  The problem is entirely that we've now come to understand
one dimension of exactly what we're getting in our identifier systems,
and that's at odds with the assumptions that the identifier systems
are relying on.  So we must do _something_ about this, because the
issue is at the fundament of the protocol we've developed.

Thanks for the continued helpful discussion,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com