Standardizing on IDNA 2003 in the URL Standard

Wed Aug 21 22:48:56 CEST 2013

--On Wednesday, August 21, 2013 17:05 +0000 Shawn Steele
<Shawn.Steele at microsoft.com> wrote:

> IMO, the eszett & even more so, final sigma, are somewhat
> display issues.  My personal opinion is we need a display
> standard (yes, that's not easy

Indeed.  But it might be worth some effort.

> A non-final sigma isn't (my understanding) a valid form of the
> word, so you shouldn't ever have both registered.  It could
> certainly be argued that 2003 shouldn't have done this
> mapping.  If these are truly mutually exclusive, then the
> biggest problem with 2003 isn't a confusing canonical form,
> but rather that it doesn't look right in the 2003 canonical
> form.  However there's no guarantee in DNS that I can have a
> perfect canonical form for my label.  Microsoft for example,
> is a proper noun, however any browser nowadays is going to
> display microsoft.com, not Microsoft.com.  (Yes, that's
> probably not "as bad" as the final sigma example).

Right.  But I think that you are at risk of confusing two
issues.  One is that, if the needs of the DNS were the only
thing that drove Unicode decisions, we all had perfect hindsight
and foresight, and it was easy to make retroactive or flag day
corrections, probably all position-dependent (isolated, initial,
medial, final in the general case) character variations would be
assigned only a code point for the base character with the
positional stuff viewed as strictly as a display issue (possibly
with an overriding qualifier codepoint).  That would have meant
no separate code point for a final sigma in Greek; no separate
code points for final Kaf, Mem, Nun, Pe, or Tsadi in Hebrew; and
so on, i.e., the way the basic Arabic block was handled before
the presentation forms were added.  If things had been done that
way, some of these things would have been entirely a display
issue, with the only difficult question for IDNA one of whether
to allow the presentation qualifier so as to permit preserving
word distinctions in concatenated strings -- in a one-case
script, selective use of final or initial character forms would
provide the equivalent of using "DigitalResearch" or "SonyStyle"
as distinctive domain name.

But it wasn't done that way.  I can identify a number of reasons
why it wasn't and indeed why, on balance, it might have been a
bad idea.  I assume Mark or some other Unicode expert would have
a longer list of such reasons than I do.   So we cope.  To a
first order approximation, the IDNA2003 method of coping was to
try to map all of the alternate presentation forms together...
except when it didn't.   And, to an equally good approximation,
IDNA2008 deals with it by disallowing the alternate presentation
forms... except when it doesn't.  The working group was
convinced that the second choice was less evil (or at least leas
of a problem) than the first one, but I don't think anyone would
really argue that either choice is ideal, especially when it
cannot be applied consistently without a lot of additional
special-case, code point by code point, rules.

Hard problem but, if we come back to the question from Anne that
started this thread, I don't think there is any good basis to
argue that the IDNA2003 approach is fundamentally better.  It is
just the approach that we took first, before we understood the
problems with it.

> Eszett is less clear, because using eszett or ss influences
> the pronunciation (at least in Germany, in Switzerland that
> can be different).  I imagine it's rather worse if you're
> Turkish and prefer different i's.  For German, nobody is ever
> going to expect fußball.ch and fussball.ch to go different
> place.

I suspect that there are other possible examples that don't have
that property.  But that is something on which Marcos should
comment.  Clearly it is within the power of the registry to
arrange for "same place" if that is what they want to do.  And,
if they do that for all such names, this whole discussion is
moot in practice.   

>...
> For words that happen to be similar, there's no expectation
> that a DNS name is available.  AAA Plumbing and all the other
> AAA whatever's out there aren't going to be surprised that
> AAA.com is already taken.

Surprised?  Probably not.  Willing to fight over who is the
"real" AAA?  Yes, and we have seen that sort of thing repeatedly.

>  So why's German more special that
> Turkish or English?

Because "ß" is really a different letter than the "ss"
sequence.  And dotless i is really a different letter than the
dotted one, just as "o" and "0" or "l" and "1" are.  If a
registry decides that the potential for spoofing and other
problems outweighs the advantages of keeping them separate and
potentially allocating them separately and either delegates them
to the same entity or blocks one string from each pair, I think
that is great.  If they make some other decision, that is great
too.  Where I have a problem is when a browser (or other lookup
application) makes that decision, essentially blocking one of
the strings, and makes it on behalf of the user without any
consideration of local issues or conventions.  

I might even suggest that, because "O" and "0" and "l" and "1"
are more confusable (and hence spoofing-prone) than "ß" and
"ss", if you were being logically consistent, you would map all
domain labels containing zeros into ones containing "o"s and
ones containing "1" into ones containing "l".  That would
completely prevent the "MICR0S0FT" spoof and a lot of others at
the price of making a lot of legitimate labels invalid or
inaccessible -- just like the "ß" case.  And, like "ß",
treating 0 or 1 as display issues would not only not help very
much, it would astonish users of European digits.

best,
   john