Standardizing on IDNA 2003 in the URL Standard
Anne van Kesteren
annevk at annevk.nl
Fri Jan 17 14:23:44 CET 2014
On Thu, Jan 16, 2014 at 6:24 PM, John C Klensin <klensin at jck.com> wrote:
> The important difference between case (iv) and the others is
> that, as others have pointed out, case (iv) is not one case and
> no one actually knows what it actually means. Yet, as I
> understand it, that is precisely what Anne is proposing to
> specify. In terms of a standard, that comes pretty close to
> "Unicode 3.2 is standardized and we hope that no properties of
> it will change; for characters included in later versions of
> Unicode, do what you like". I can't think of anything kind to
> say about that.
That is not what I'm proposing though. It might be important to
distinguish UI from DNS.
What's important for interoperability in domain names is translation
of a sequence of code points to a sequence of bytes that can be used
within the DNS. If you take IDNA2003, an updated version of Unicode,
and assume the same algorithms defined in IDNA2003 apply you have an
algorithm that defines just that. (UTS46 in compatibility mode appears
to be basically that, minus a couple of exceptions I should probably
investigate at some point.)
Then there's another aspect which is UI. Making sure the user is not
spoofed, etc. Browsers already differ what they are willing to show to
the user in Unicode and what they will show in "ASCII". E.g. Chrome
has a policy where it will only use ToUnicode if the code points can
reasonably be assumed to be within a range that the user's locale
matches. See http://wiki.whatwg.org/wiki/URL#UI for some pointers.
I guess both of these is what you later call "href" and "user input".
Given that these are already decoupled I don't really see why we
should not have mapping in "href" consistent with what we provide now.
And frankly, I don't see it going away.
More information about the Idna-update