IAB Statement on Identifiers and Unicode 7.0.0

Thu Jan 29 02:02:18 CET 2015

I’m confused.  Mark seems to be very clearly saying that “there are more egregious ambiguities in IDNA, but the discussion is spending a lot of energy about the 1% case rather than the bigger problems.”

-Shawn

From: Idna-update [mailto:idna-update-bounces at alvestrand.no] On Behalf Of Vint Cerf
Sent: Wednesday, January 28, 2015 12:20 AM
To: Mark Davis ☕️
Cc: IDNA update work; Andrew Sullivan
Subject: Re: IAB Statement on Identifiers and Unicode 7.0.0

Mark,

I am reading your message as saying "ambiguity is ok if there are few instances of it" while some of us would like the handling of identifiers encoded with Unicode to be unambiguous.

vint

On Wed, Jan 28, 2015 at 3:15 AM, Mark Davis ☕️ <mark at macchiato.com<mailto:mark at macchiato.com>> wrote:
You said earlier:

> That's a false alternative, and I don't think it's in any way reasonable.
> We don't argue, "Car accidents cause lots of death, so death from influenza isn't important."

That is, however, a false analogy. A more accurate one would be:

Car accidents cause lots of death, so in comparison, the sniffles are not that important.

That is:

Many characters and sequences cause confusability problems, and in comparison U+08A1 is not that important.

And, like so many of these discussions, there is no data behind any of this Sturm und Drang around U+08A1 and related characters. If the IETF were serious about these issues, it would gather the data to see where the biggest problems are in reality. It would then focus on the biggest ticket items, to see if it can come up with solutions to those.

With your analogy, it would figure out how many deaths there are due to car accidents, slips in the bathtub, and so on (http://www.who.int/mediacentre/factsheets/fs310/en/), and focus its resources on those big ticket causes of death where it can make a difference, not focus on the sniffles.

Mark<https://google.com/+MarkDavis>

— Il meglio è l’inimico del bene —

On Wed, Jan 28, 2015 at 3:22 AM, Andrew Sullivan <ajs at anvilwalrusden.com<mailto:ajs at anvilwalrusden.com>> wrote:
On Wed, Jan 28, 2015 at 01:13:28AM +0000, Shawn Steele wrote:
> It focuses on edge cases of confusable characters.  These are a very small part of the potential for confusion in IDNA.
>

I'm sorry, but I do not agree that it focusses on that.  In
particular, it says this:

"What is peculiar about these cases, as distinct from other confusable
cases, is that the decomposed and precomposed forms are in the same
script and cannot be distinguished visually by users, even in large
fonts designed for clarity. It is only by knowing the language that it
is possible to detect whether a use of the character is the correct
one."

Other kinds of confusable characters are also important, but that is
_not_ what this particular case is about, and just saying, "They're
all the same," doesn't make that true.

In addition, there is something not in the statement but that occurred
to me today because of a conversation.  Some of the examples that have
been used have different properties.  For instance, all the cases that
the IAB statement is talking about are always in the same script.  In
addition, it strikes me, they have a bunch of other properties in
common (for instance, they're all letters).  The basic problem is that
there isn't an algorithmic way to distinguish between them at all;
indeed, that's how it is that several of these are PVALID in IDNA2008,
because I think if this issue had been clear to all of us when working
on that specification we'd have worked a little harder to determine
whether we had an extra exception class.

The entire justification that we've seen for these different encodings
is linguistic.  And that is no doubt correct, for the purposes that
Unicode needs to be put in general.  The problem is that identifiers
aren't _in_ a language, and even if they were most of the time you
can't know what the right language is because there's no such metadata
with the identifier.  This is the nub of the problem.

> Certainly having identifiers that are consistent is good

Perhaps you have a different understanding of the meaning of
"identifier" than I do.  I do not think that consistent identifiers is
some sort of nice to have pretty good idea.  I think it is an
essential element of any identifier system that it be as consistent
and predictable as is possible.

The present example is a case where the critical determining factor --
the linguistic metadata -- is the thing that is necessarily missing.
That's quite different from ï/i or ß/ss because a clued-in user can
handle those things (even if Joe Random Language Speaker can't).  It's
also different from cases like TAMIL LETTER KA vs. TAMIL DIGIT ONE:
the former is general category Other_Letter and the latter is
Decimal_Number.  So in that case, it's at least possible to write some
rules about what things you can use by category.  There's no way to
tell whether you have a two-codepoint composition that renders BEH
with a HAMZA ABOVE or whether you have a single codepoint BEH WITH
HAMZA ABOVE (cf. the other _hamza_ cases), _and_ there's no way in
principle to write any software that could possibly detect that you
might have an issue here, at least without carrying around a big
exception table.

Somewhat earlier, Asmus argued that the UTC had discovered that
exception lists is the only thing that would work for some of these
cases.  If so, then good, but it suggests to me that we might need a
new list of exceptions for identifiers.  It appears to me that this
may be a different list of exceptions than any of the existing ones,
but I confess that I have not managed to peruse every single possible
candidate exception list yet.

Again, remember, this is not just domain names we're talking about (at
least in the IAB statement), so saying "let the registries solve this"
won't automatically work.

Best regards,

A

--
Andrew Sullivan
ajs at anvilwalrusden.com<mailto:ajs at anvilwalrusden.com>
_______________________________________________
Idna-update mailing list
Idna-update at alvestrand.no<mailto:Idna-update at alvestrand.no>
http://www.alvestrand.no/mailman/listinfo/idna-update

_______________________________________________
Idna-update mailing list
Idna-update at alvestrand.no<mailto:Idna-update at alvestrand.no>
http://www.alvestrand.no/mailman/listinfo/idna-update

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20150129/f2b62784/attachment-0001.html>