Unicode & IETF

Tue Aug 12 19:14:35 CEST 2014

> Shawn, your note made me try to think through the basic motivations of IDNA2008. To enable IDNs without having to change every resolver in the world, it was concluded to use a  mapping from Unicode to a representation employing only ASCII strings through a coding step called punycode. It was important that the mappings be one to one between the Unicode and the punycoded forms. These two forms were designed to be canonically equivalent.

Thanks, that's better said that I'd've done :)  I translate this as "IDN encodes UTF-8 or UTF-16 values in Punycode in a reversible form."  This leaves out the mapping step, but that's fine.

> That was one of the versions of the"X==Y" you reference in the paragraph above. A second assumption was that it was possible to use only the Unicode properties of the Unicode characters to determine whether a [new] character was or was not allowed for use in IDNs. The reason this was considered valuable was precisely because it decoupled the class of PVALID characters from any particular version of Unicode. IDNA2003 did not have that property. Instead, it used what John K and others called "normative tables."

> The basic need in DNS is for a resolver to be able to find, in an efficient way a domain name in a hierarchical and distributed structure. To do this, DNS has to be able to compare ASCII strings as equal in a reliable way. To do that, it is important to get the Unicode elements of an IDN label into a canonical order so that comparison of either the Unicoded elements (e.g. in
UTF8) or the punycoded (ASCII) elements can detect equality by simple string comparison.

This is the mapping step I guess.  The key point is that it needs to "compare ASCII strings as equal in a reliable way".  

> When strings that users would regard as "the same" have ambiguous representations in either the Unicoded or the punycoded sequences, the ambiguity can result in failure to find the appropriate domain name in the DNS. Or, worse, one may find the "wrong" one in the case that the ambiguous versions have been independently registered and map to different IP addresses. This is not about "confusables" in the sense that some characters look like others. It is about the fact that the same glyph has multiple encodings that do not collapse to an unambiguous canonical form.

The key thing I see here is "When strings that users would regard as 'the same'"...  I would hypothesize that:

A) What "users regard as the same" is likely very broad.  Some examples that would likely break this rule in existing IDN:
   * Hawai'i.com and Hawaiʻi.com
   * Hawaii.com and either of the above.  (Yes, we know they're different, but would grandma?)
   * Fussball.ch and fußball.ch (or Fussball.de and fußball.de despite all the discussion in IDNA2008)
   * Any alternate spelling because someone's charset, font, keyboard, or other support isn't completely stable.  In addition to the ʻokina, the kahakō had this problem in Hawaiian too, being replaced with a dieresis and a font hack as a workaround.  Maori had a similar issue.  
   * For that matter probably any umlauted word in German and it's ASCII ae alternate spelling.
   * Any workaround where people dropped Latin diacritics because DNS was ASCII only.  It's still hard to find sites that switched to the 'correct' IDN form since the munged ASCII form is working for them.
   * I'll skip totally stupid stuff, like http://trustme.com/Microsoft & http://Microsoft.com since that's not actually IDN.

> The argument against allowing the new character is found in the paragraph above and is not about glyph confusion. It is about coding ambiguity.

Unfortunately in many places there are ambiguous ways of coding their language.  That's particularly true in places where the language is evolving.  Ironically, I'm not sure it's actually true in the cited example since there may be missing keyboard support as well.  In my experience that's true with most of the new Unicode code points, if there were an acceptable way of spelling it already, then either A) they wouldn't ask for support, or B) the UTC would (politely) try to explain why it wasn't different.

Anyway, I've lost where I'm going with this email, but I think that you & I have fundamentally different views on how much rigor, and at what layer, to apply rules to fix "what users regard as the same".

I think that to handle any 'alternate encodings' this makes sense to me:
A) Normalization is a great aid to reduce user confusion and guide people on different systems (NFC keyboard vs NFD keyboard) to a common domain name.
B) It's appropriate to normalize as part of the lookup step.
C) There's a lot of "users regard as the same" stuff that isn't fixed by normalization.
D) It doesn't really matter (to me) if it's something obviously unsolvable by IDN, like Hawaii and Hawaiʻi, or if it's something that maybe isn't 'perfect' in Unicode for whatever reason.
E) Those things (IMO) should all go into the layer where registrars bundle/block similar registrations.

And my points above (I may have missed something) are indeed what IDN is doing, particularly in IDNA2008.  So, in my view, we're quibbling about whether it should be handled in A) normalization, or E) bundling/blocking.  I really don't see any difference.  It some cases normalization is more convenient, however I'm willing to accept that this new character is different than the old character as claimed, and that users in the respective language(s) aren't likely to spell it wrong.  And if they do, bummer, they'll have to learn to type it 'correctly', or bundle, or something.  Considering the millions of other possible names with this problem I have a hard time getting more excited than that about it.

BTW: lots of the supposition has been "what if they type it the other way?"  I would expect that, given a lack of a normalization mapping, systems that typically expose NFD forms in the typing will still have each language's keyboard have the correct spelling for that language.  And same for the NFC systems.  Any keyboard that used the wrong character here would be in error.

-Shawn