Standardizing on IDNA 2003 in the URL Standard

Thu Aug 22 13:38:00 CEST 2013

I think it is time to start a serious campaign to move to the IDNA2008
standard for the simple reason that it decouples dependence on a fixed and
now very old version of UNICODE. Opinions about backward compatibility
vary. I am more sanguine about accepting incompatibility with past choices
than others - the non-letter characters may be cute but their cost is too
high and utility too low, as I see it.

As the TLD space expands and IDNs become more popular, canonical
representations and decoupling from versions of UNICODE are essential for
stability, uniformity and interoperability.  It will only get more messy
with time if we don't get going on this objective.

vint

On Thu, Aug 22, 2013 at 7:02 AM, Gervase Markham <gerv at mozilla.org> wrote:

> On 22/08/13 11:37, Anne van Kesteren wrote:
> >> Shame for them. The writing has been on the wall here for long enough
> >> that they should not be at all surprised when this stops working.
> >
> > I don't think that's at all true. I doubt anyone realizes this. I
> > certainly didn't until I put long hours into investigating the IDNA
> > situation.
>
> It's not been possible to register names like ☺☺☺.com for some time now;
> that's a big clue. The fact that Firefox (and other browsers, AFAIAA)
> refuses to render such names as Unicode is another one. (Are your
> friends really using http://xn--74h.example.com/ ?)
>
> Those two things, plus the difficulty of typing such names, means that
> their use is going to be pretty limited. (Even the guy who is trying to
> flog http://xn--19g.com/ , and is doing so on the basis of the fact that
> this particular one is actually easy to type on some computers, has not
> in the past few years managed to find a "Macintosh company with a
> vision" to take it off his hands.)
>
> > Furthermore, we generally preserve compatibility on the web so URLs
> > and documents remain working.
> > http://www.w3.org/Provider/Style/URI.html It's one of the more
> > important parts of this platform.
>
> (The domain name system is about more than just the web.)
>
> IIRC, we must have broken a load of URLs when we decided that %-encoding
> in URLs should always be interpreted as UTF-8 (in RFC 3986), whereas
> beforehand it depended on the charset of the page or form producing the
> link. Why did we do that? Because the new way was better for the future,
> and some breakage was acceptable to attain that goal.
>
> So what is the justification for removal of non-letter characters?
> Reduction of attack surface. When characters are divided into scripts,
> we can enforce no-script-mixing rules to keep the number of possible
> spoofs, lookalikes and substitutions tractable for humans to reason
> about in the case of a particular TLD and its allowed characters. If we
> allowed 3,254 extra random glyphs in every TLD, this would not be so.
>
> Gerv
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20130822/e343c6d3/attachment-0001.html>