IAB Statement on Identifiers and Unicode 7.0.0

Wed Jan 28 03:22:31 CET 2015

On Wed, Jan 28, 2015 at 01:13:28AM +0000, Shawn Steele wrote:
> It focuses on edge cases of confusable characters.  These are a very small part of the potential for confusion in IDNA.  
> 

I'm sorry, but I do not agree that it focusses on that.  In
particular, it says this:

"What is peculiar about these cases, as distinct from other confusable
cases, is that the decomposed and precomposed forms are in the same
script and cannot be distinguished visually by users, even in large
fonts designed for clarity. It is only by knowing the language that it
is possible to detect whether a use of the character is the correct
one."

Other kinds of confusable characters are also important, but that is
_not_ what this particular case is about, and just saying, "They're
all the same," doesn't make that true.

In addition, there is something not in the statement but that occurred
to me today because of a conversation.  Some of the examples that have
been used have different properties.  For instance, all the cases that
the IAB statement is talking about are always in the same script.  In
addition, it strikes me, they have a bunch of other properties in
common (for instance, they're all letters).  The basic problem is that
there isn't an algorithmic way to distinguish between them at all;
indeed, that's how it is that several of these are PVALID in IDNA2008,
because I think if this issue had been clear to all of us when working
on that specification we'd have worked a little harder to determine
whether we had an extra exception class.

The entire justification that we've seen for these different encodings
is linguistic.  And that is no doubt correct, for the purposes that
Unicode needs to be put in general.  The problem is that identifiers
aren't _in_ a language, and even if they were most of the time you
can't know what the right language is because there's no such metadata
with the identifier.  This is the nub of the problem.

> Certainly having identifiers that are consistent is good

Perhaps you have a different understanding of the meaning of
"identifier" than I do.  I do not think that consistent identifiers is
some sort of nice to have pretty good idea.  I think it is an
essential element of any identifier system that it be as consistent
and predictable as is possible.

The present example is a case where the critical determining factor --
the linguistic metadata -- is the thing that is necessarily missing.
That's quite different from ï/i or ß/ss because a clued-in user can
handle those things (even if Joe Random Language Speaker can't).  It's
also different from cases like TAMIL LETTER KA vs. TAMIL DIGIT ONE:
the former is general category Other_Letter and the latter is
Decimal_Number.  So in that case, it's at least possible to write some
rules about what things you can use by category.  There's no way to
tell whether you have a two-codepoint composition that renders BEH
with a HAMZA ABOVE or whether you have a single codepoint BEH WITH
HAMZA ABOVE (cf. the other _hamza_ cases), _and_ there's no way in
principle to write any software that could possibly detect that you
might have an issue here, at least without carrying around a big
exception table.

Somewhat earlier, Asmus argued that the UTC had discovered that
exception lists is the only thing that would work for some of these
cases.  If so, then good, but it suggests to me that we might need a
new list of exceptions for identifiers.  It appears to me that this
may be a different list of exceptions than any of the existing ones,
but I confess that I have not managed to peruse every single possible
candidate exception list yet.

Again, remember, this is not just domain names we're talking about (at
least in the IAB statement), so saying "let the registries solve this"
won't automatically work.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com