Standardizing on IDNA 2003 in the URL Standard

Fri Jan 31 00:24:28 CET 2014

2014-01-16 Mark Davis ☕ <mark at macchiato.com>

> I will be brief, because I don't have much time for this topic this week.
> (It should teach me to be quiet...)
>
> Mark <https://google.com/+MarkDavis>
>
> *— Il meglio è l’inimico del bene —*
>
>
> On Thu, Jan 16, 2014 at 3:27 PM, Anne van Kesteren <annevk at annevk.nl>wrote:
>
>> On Thu, Jan 16, 2014 at 1:24 PM, Mark Davis ☕ <mark at macchiato.com> wrote:
>> > It is not unlikely that an implementation that you think is following
>> > IDNA2003 (with a non-standard, larger repertoire) is actually following
>> UTS
>> > 46.
>>
>> I know for a fact that Gecko has not changed its implementation (but
>> has updated Unicode since the release of IDNA2003, doh). It "passes"
>> the Pile of Poo Test™:
>>
>> <a href="http://💩.com/">test</a>
>> <script>alert(document.querySelector("a").host)</script>
>>
>
> The problem is, as Andrew and others have said, IDNA2003 does not specify
> *how* one would update to a new version of Unicode: that is, exactly which
> new characters would be accepted and which not, and how to case-map them.
>
>
>> Alerts: xn--ls8h.com
>>
>> Chrome alerts the same and reportedly has updated to UTS46 (compatible
>> mode), so as you point out the differences are probably minor and
>> require checking of some obscurer code points.
>>
>>
>> > There is a table in
>> > http://unicode.org/reports/tr46/#Table_IDNA_Comparisons
>>
>> That is an interesting table. Ⅎ (line c) seems indeed disallowed in
>> Chrome, yet 㛼 (line d) which should also be disallowed per that table
>> works fine. Both work fine in Firefox. Both Chrome and Firefox map ！
>> (line b) to ! and do not cause parsing to fail because of it, even
>> though the table suggests it should. (Presumably do it making
>> assumptions about ASCII that browsers do not share.)
>>
>
> I'd have to look at those cases.
>

You used  U+3BFC (  㛼 ) instead of U+2F868 ( 㛼 ). Try this 㛼.com instead.


>
>>
>> Firefox and Safari map ؂ (line i) and Chrome does not.
>>
>>
>> > One way to look at UTS 46 is as a migration layer to support client
>> > implementations during the transition of registries from IDNA2003 to
>> > IDNA2008, plus a mapping layer that can be used with straight IDNA2008.
>>
>> I'm not sure what this means. Do you think we will ever stop mapping
>> U+3002 to U+002E?
>
>
>
>
>> Or A to a?
>>
>
> I'm assuming that you mean the ascii characters (I'm not going to check
> whether you have just look-alikes.). ASCII case mapping is covered at a
> different level.
>
> I don't think clients would stop
> mapping, and IDNA2008 permits it. That's why I said "
> plus a mapping layer that can be used with straight IDNA2008
> "
>
>
>
>
>>
>> >> I think I did mention earlier on UTS46 might be okay, depending on the
>> > details. I am hoping to hear from Mark on the matter.
>> >
>> > I'm not sure what specific questions you have about UTS 46. Can you
>> > reiterate them?
>>
>> You keep talking about UTS 46 as if it were a migration layer, which
>> suggests it might go away. That does not really seem acceptable to me.
>>
>
> UTS 46 will stay around, if only for the mapping layer.
>
> Whether the rest would be used by clients really depends on the progress
> made by registries. As for the deviation-character support, I think
> implementations could stop supporting them if the affected
> registries
>  enforced bundle-or-block. As to the additional symbols,
> implementations could stop
> supporting
>  them
>  if the
> registries
>  forbade them.
> 
> 
>
>>
>> It enforces DNS length restrictions on domain names (IDNA2003 did the
>> same), which does not appear to be implemented in browsers. They're
>> fine with a label longer than a hundred code points. I don't think
>> this should be outlawed at the parsing layer because the name might be
>> used outside the DNS.
>>
>
> That was never a topic of discussion in any of the standards discussions.
> 
>
>>
>> I wish it contained the actual ASCII restrictions we need in practice
>> rather than deferring those to the application, but I suppose I can
>> define those in the URL Standard and use UseSTD3ASCIIRules=false.
>>
>> Another wish I have is that the algorithms are a bit clearer in terms
>> of input and output. What argument does ToASCII take? What about
>> ToUnicode?
>>
>> E.g. how would you replace "domain to ASCII" and "domain to Unicode"
>> in http://url.spec.whatwg.org/#concept-host-parser with UTS46 and
>> ensure the algorithm still has the same kind of expected output?
>
>
> http://unicode.org/reports/tr46/#ToASCII
>
> If there are specific areas where you find the spec unclear, I suggest
> that you provide feedback as instructed at the top of the spec. Subsequent
> versions can then clarify those points.
>
>
>> It's
>> not entirely clear to me how to make use of your work.
>>
>
> You may not have meant a singular 'you', but just for clarity: it's not
> "my" work; it is the work of the Unicode consortium, with many individuals
> and companies involved.
> 
>
>>
>>
>> --
>> http://annevankesteren.nl/
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140130/70d30a72/attachment-0001.html>