IDN trends

Sun Dec 16 01:15:09 CET 2007

On Dec 15, 2007 9:04 AM, John C Klensin <klensin at jck.com> wrote:
>
>
> --On Saturday, 15 December, 2007 07:55 -0800 Erik van der Poel
> <erikv at google.com> wrote:
>
> >...
> >> I'd prefer the first model, but I think we need to tighten up
> >> the rules about unassigned characters. We'd probably want
> >> implementations that claim conformance to one version of
> >> Unicode to reject FQDNs with characters that are unassigned
> >> in that version of Unicode, so that the implementation does
> >> not leave upper-case as is, or try to perform NFKC on
> >> characters that it does not know about.
> >
> > An example of this is U+03F7, which has a lower-case mapping
> > to U+03F8 in Unicode 4.0. Both of these are unassigned in
> > Unicode 3.2, but Firefox 1.5 and 2 do not reject these
> > characters. Instead they send out two *different* DNS packets,
> > depending on whether the upper-case U+03F7 or the lower-case
> > U+03F8 was present in the original. This is an
> > interoperability problem, since MSIE 7 and Opera 9 both wisely
> > reject such labels. U+03F7 is NEVER in IDNA200X, while U+03F8
> > is ALWAYS.
>
> Thanks for the specific example, which I hadn't had time to dig
> out (I'm on travel again, between planes at the moment).  This
> sort of thing --both wrt case-mapping and wrt NFKC-- is exactly
> why we are in need of a strong ban on unassigned characters. If
> one believes that IDNA implementations can be locked at 3.2 in
> practice then one could claim that Firefox's handling these
> packets at all is a protocol violation since they do not appear
> in the Nameprep/ Stringprep tables.  But it may be another
> illustration of why it is hard or impossible to bind an IDNA
> implementation to a particular version of Unicode.

I think the problem is that IDNA2003 explicitly specifies a flag
called AllowUnassigned, and StringPrep explicitly states that "stored
strings" must not contain unassigned codepoints while "queries" may
contain them (see section 7 of RFC 3454). Of course, browsers are in
the business of sending queries.

IDNA is now being updated, so presumably this issue will be solved for
IDNA, but I wonder whether StringPrep should be updated, and more
importantly, whether people involved with other profiles of StringPrep
should be notified.

> > This has been discussed before, but I wonder whether anyone has
> > changed their mind about this.
>
> I have, if anything, gotten more convinced.  But my opinion is
> not the most important one here.
>
> >> > If, by contrast, we assume that browser vendors (and those
> >> > who produce code for other applications ... once again, if
> >> > we could assume that IDNs will be used only on the web, the
> >...
>
> >> The browsers *already* map on-the-wire forms, i.e. URIs/IRIs
> >> in HTML. In the case of an <a> tag, the user must consciously
> >> click on it, but in the case of an <img> tag, the browser
> >> automatically performs IDNA2003, so there would be
> >> interoperability problems if browsers stopped mapping a la
> >> IDNA2003. Now, one could argue that there are so few
> >> non-ASCII URIs/IRIs on the Web that it wouldn't matter if the
> >> browsers stopped mapping, but I haven't seen any indication
> >> from the browser developers that they will stop mapping.
>
> If they don't, I don't see it as a problem.  But I do see it as
> important that we move toward URLs that are as unambiguous and
> directly comparable as possible.  To take a handy example, while
> one could certainly write IDNA-specific comparison code
> (converting any U-labels to A-labels before comparing), the
> theory behind IDNA suggests that one should not have to
> recognize IDNs and perform that extra operation in order to know
> whether two links should be counted as pointing to the same
> place.

I guess I have a different viewpoint. At Google, we lower-case host
names as a matter of course. For us, IDNA is just another kind of
canonicalization that we must perform.

> >> In which case, I'd rather have a *descriptive* spec that
> >> explains what the browsers, etc are doing, than a
> >> *prescriptive* spec that tries to tell them to stop mapping.
> >> I have always liked the tendency of RFCs to be descriptive
> >> rather than prescriptive, though of course there has to be
> >> some balance between the two, particularly during the initial
> >> stages of a protocol's adoption.
> >
> > Of course, the IDNA200X protocol draft does not tell
> > developers to stop mapping, but it does not give an exact
> > description of the mapping either.
> >
> > I might even be OK with an Experimental RFC that precisely
> > describes how the mappings are derived from any current or
> > future version of Unicode, as long as these details are
> > written down somewhere.
>
> While I had hoped to avoid it, largely because of concerns about
> available time and cycles, you are making what seems to me to be
> a strong case for a document that describes the types of
> mappings that might be appropriate in various circumstances.  My
> gut instinct is that, e.g., case mappings, at least silent case
> mappings, may not be a good idea when the users aren't used to
> looking at scripts that normally handle case and that, for
> systems localized for such users and for users who might expect
> special handling of the odd cases (such as the notorious Turkic
> dotless "i") warnings or rejection might be better than case
> mappings.  (I'd like to hear from Gerv and others on the browser
> side about that subject.)

I think we need to draw a very clear distinction between FQDNs that
are processed with or without user intervention. In a UI, it is
appropriate to perform some mappings and it is important to show the
result of those mappings, to confirm the user's input. If the user is
confused by the result, they can press the Help button (or whatever).
In a browser, the user might type an IRI in the Location bar. In an
HTML editor, the author might type an IRI in the link editor.

But FQDNs go on-the-wire (i.e. DNS and HTTP in the case of a browser)
without user intervention if they appear in image tags <img
src="http://www.example.com/images/foo.gif">. URIs/IRIs in <a> tags
*are* shown to the user when they hover over the link with their
mouse, but the browser typically runs the FQDN through ToASCII and
ToUnicode first (or shows Punycode if the Unicode version is
considered dangerous) and does not typically allow the user to edit
the FQDN before accessing it.

There has to be a single and clear spec for FQDNs that are processed
without user edits. That is why one of my earlier emails outlined
*three* separate specs, namely the protocol, the mappings and the UI
recommendations. The mappings and the UI recommendations could
conceivably live in the same document, but I would argue that we
should clearly distinguish the two so that there is absolutely no
question about the exact mappings that must be performed when the FQDN
is not edited by the user.

Erik