Updating RFC 5890-5893 (IDNA 2008) to Full Standard

Thu Nov 8 16:26:51 CET 2012

Folks, this is not a discussion I can have in a thoughtful way
this week so, while there are a few comments below, I won't be
able respond to this thread any more until late next week.

--On Thursday, 08 November, 2012 10:05 +0100 Anne van Kesteren
<annevk at annevk.nl> wrote:

> On Thu, Nov 8, 2012 at 6:45 AM, "Martin J. Dürst"
> <duerst at it.aoyama.ac.jp> wrote:
>> By chance, I found the reference that I was referring to
>> again: http://annevankesteren.nl/2012/09/idna2008
> 
> I'm not sure what Internet Explorer does, but of the other
> browsers only Opera implements IDNA2008 (and does not do it
> per the recommendations of UTS #46, and is probably
> incompatible with deployed content and needs to change).

See the response I just sent to Martin about UTR 46 and note
that is _not_ part of IDNA2008.  What may be more relevant than
what the browsers are or are not doing is that several
registries are eliminating labels that are not
IDNA2008-conformant either at renewal time or earlier.  So, if
"content" contains non-conformant strings, there is a rising
danger of non-resolution regardless of what browsers do.

> The main problem with IDNA2008 is that it makes a large corpus
> of domain strings effectively undefined, since it does not
> define pre-processing between e.g. finding an domain string in
> a page and applying the relevant algorithms and does not
> forbid such an algorithm either. In particular
> http://tools.ietf.org/html/rfc5891#section-5.2 is very
> handwavy whereas IDNA2003 is perfectly clear.

I think you are misreading the spec and I don't even know what
you mean by "implementing IDNA2008" in this context.   The
bottom line is that the only labels that are valid for IDNA2008
are those for which there is a reversible one-one mapping
between the native character string and the Punycoded one.  All
such labels are valid for lookup under IDNA2003.  The paragraph
you cite isn't handwaving.  It basically says that, if someone
needs to map in a local environment, they should map as needed
(in that context, UTF 46 and RFC 5895 are just recommendations)
but that, in the general case, mapping is not advisable.  And it
is not advisable for a reason that, ironically, was first
discovered (AFIK, at least) by browser vendors: it the user
supplies a string, it is converted to what is now called an
A-label and then converted back and a different string produced,
users get very confused.  Worse, if we "train" users to
understand that it is ok that the strings that appear from
reverse mapping can be unlike (and perhaps unrelated in any way
that they can understand) their original input, we desensitize
them to clues of phishing and other attacks.

But I don't understand how you tell by testing that a browser
has implemented IDNA2008.  Even in the notorious Eszett case,
lower-case Eszett was an undefined code point in IDNA2003.
IDNA2008 requires that a label containing it be mapped to an
A-label and looked up.  Because of its "look up almost anything"
rule, an IDNA2003 implementation must map it to an A-label and
look it up -- exactly the same treatment.  An IDNA2003
implementation that maps it to "ss" is simply non-conforming
since Stringprep doesn't contain a mapping for the character.

Now, if browsers implemented a made-up IDNA2005 (i.e., IDNA2003
with a version of Stringprep that they guessed at without any
standardization process or with a Unicode 5.x version of case
mapping) then there is an incompatibility with IDNA2008.  But
that isn't non-compliance with IDNA2003.  

On the other hand, if the user types Eszett in upper case,
IDNA2003 maps it to "ss".  IDNA2008 doesn't prohibit that... it
just says that a label containing it isn't a U-label and that
what the lookup application does is up to it.  It strongly
implies that a warning message would be appropriate, but doesn't
require even that.

> Overall, making backwards incompatible changes to domain string
> processing seems wildly inappropriate and I wonder why the IAB
> did not intervene. The rationale document cites "surprising"
> and "sensible" but I think we need something stronger to
> invalidate running code. (And if we don't, then the arguments
> I get about URI/IRI seem out of whack.)

I don't have time to rehash that argument, but the WG decided
that IDNA2003 contained some serious design mistakes, some of
which could make legitimate labels and reasonable uses of
scripts and languages inaccessible.   Remembering that, if
writers of HTML had used only those native character strings
that could be produced by mapping back from the Punycode-encoded
forms (something that we strongly recommended in many quarters
from the beginning) or used the Punycode-encoded forms
themselves, there would be no incompatibilities, and also
remembering that the number of users and uses of IDNs a couple
of years ago was expected to be a tiny, almost insignificant,
fraction of the users a decade hence, getting rid of the
unreasonable restrictions and reducing the security risks seemed
to the WG to be a reasonable tradeoff.  Perhaps we should have
changed the prefix rather than creating a transition problem for
a few characters, but that, of course, would have been much
harder on browser vendors as registries transitioned, especially
those who browser vendors who are sensitive about performance.

best,
   john