Standardizing on IDNA 2003 in the URL Standard

Thu Jan 16 15:27:03 CET 2014

On Thu, Jan 16, 2014 at 1:24 PM, Mark Davis ☕ <mark at macchiato.com> wrote:
> It is not unlikely that an implementation that you think is following
> IDNA2003 (with a non-standard, larger repertoire) is actually following UTS
> 46.

I know for a fact that Gecko has not changed its implementation (but
has updated Unicode since the release of IDNA2003, doh). It "passes"
the Pile of Poo Test™:

<a href="http://💩.com/">test</a>
<script>alert(document.querySelector("a").host)</script>

Alerts: xn--ls8h.com

Chrome alerts the same and reportedly has updated to UTS46 (compatible
mode), so as you point out the differences are probably minor and
require checking of some obscurer code points.

> There is a table in
> http://unicode.org/reports/tr46/#Table_IDNA_Comparisons

That is an interesting table. Ⅎ (line c) seems indeed disallowed in
Chrome, yet 㛼 (line d) which should also be disallowed per that table
works fine. Both work fine in Firefox. Both Chrome and Firefox map ！
(line b) to ! and do not cause parsing to fail because of it, even
though the table suggests it should. (Presumably do it making
assumptions about ASCII that browsers do not share.)

Firefox and Safari map ؂ (line i) and Chrome does not.

> One way to look at UTS 46 is as a migration layer to support client
> implementations during the transition of registries from IDNA2003 to
> IDNA2008, plus a mapping layer that can be used with straight IDNA2008.

I'm not sure what this means. Do you think we will ever stop mapping
U+3002 to U+002E? Or A to a?

>> I think I did mention earlier on UTS46 might be okay, depending on the
> details. I am hoping to hear from Mark on the matter.
>
> I'm not sure what specific questions you have about UTS 46. Can you
> reiterate them?

You keep talking about UTS 46 as if it were a migration layer, which
suggests it might go away. That does not really seem acceptable to me.

It enforces DNS length restrictions on domain names (IDNA2003 did the
same), which does not appear to be implemented in browsers. They're
fine with a label longer than a hundred code points. I don't think
this should be outlawed at the parsing layer because the name might be
used outside the DNS.

I wish it contained the actual ASCII restrictions we need in practice
rather than deferring those to the application, but I suppose I can
define those in the URL Standard and use UseSTD3ASCIIRules=false.

Another wish I have is that the algorithms are a bit clearer in terms
of input and output. What argument does ToASCII take? What about
ToUnicode?

E.g. how would you replace "domain to ASCII" and "domain to Unicode"
in http://url.spec.whatwg.org/#concept-host-parser with UTS46 and
ensure the algorithm still has the same kind of expected output? It's
not entirely clear to me how to make use of your work.

-- 
http://annevankesteren.nl/