IDN trends

Sat Dec 15 16:55:10 CET 2007

On Dec 14, 2007 7:02 PM, Erik van der Poel <erikv at google.com> wrote:
> On Dec 14, 2007 11:58 AM, John C Klensin <klensin at jck.com> wrote:
> > I don't know where this leaves us.  If our primary criterion is
> > to avoid disturbing any current use or representation of IDNs
> > (in or out of guidelines), then I think we will end up with
> > either a "no changes at all unless they are strictly expansions
> > of the set of valid strings" model or some approximation of the
> > "one set of rules for strings that use nothing but Unicode 3.2
> > characters and another set for strings that use anything more
> > recent" model that I outlined in an earlier note.  As Patrik
> > indirectly pointed out, the first model essentially requires
> > that, once the Unicode Consortium assigns a property value, that
> > property value must be fixed forever, with no possibility of
> > revision, to be stable enough (a look-aside list for previous
> > values doesn't work unless it becomes just a way to create that
> > immutable list).   And the second probably gets us into silly
> > states if we preserve the dot-mappings and hence require that
> > any processing that is conditioned on whether or not characters
> > appear in Unicode 3.2 be based on an entire FQDN, not individual
> > labels (e.g., the same label-string could be interpreted
> > differently depending on the FQDN in which it occurred).
>
> I'd prefer the first model, but I think we need to tighten up the
> rules about unassigned characters. We'd probably want implementations
> that claim conformance to one version of Unicode to reject FQDNs with
> characters that are unassigned in that version of Unicode, so that the
> implementation does not leave upper-case as is, or try to perform NFKC
> on characters that it does not know about.

An example of this is U+03F7, which has a lower-case mapping to U+03F8
in Unicode 4.0. Both of these are unassigned in Unicode 3.2, but
Firefox 1.5 and 2 do not reject these characters. Instead they send
out two *different* DNS packets, depending on whether the upper-case
U+03F7 or the lower-case U+03F8 was present in the original. This is
an interoperability problem, since MSIE 7 and Opera 9 both wisely
reject such labels. U+03F7 is NEVER in IDNA200X, while U+03F8 is
ALWAYS.

This has been discussed before, but I wonder whether anyone has
changed their mind about this.

> > If, by contrast, we assume that browser vendors (and those who
> > produce code for other applications ... once again, if we could
> > assume that IDNs will be used only on the web, the problems
> > would change considerably) will be reasonably careful about what
> > is right and reasonable for their audiences, then it becomes
> > reasonable to predict little change for the typical end-user.
> > Things that they expect to be mapped will be mapped.  Things
> > that they don't expect to be mapped will be less likely to be
> > mapped, but that will reduce confusion about unfamiliar
> > variations of unfamiliar characters.  And the on-the-wire forms
> > will not be mapped at all, which will reduce several
> > opportunities for confusion and the potential for
> > interoperability failures.
>
> The browsers *already* map on-the-wire forms, i.e. URIs/IRIs in HTML.
> In the case of an <a> tag, the user must consciously click on it, but
> in the case of an <img> tag, the browser automatically performs
> IDNA2003, so there would be interoperability problems if browsers
> stopped mapping a la IDNA2003. Now, one could argue that there are so
> few non-ASCII URIs/IRIs on the Web that it wouldn't matter if the
> browsers stopped mapping, but I haven't seen any indication from the
> browser developers that they will stop mapping.
>
> In which case, I'd rather have a *descriptive* spec that explains what
> the browsers, etc are doing, than a *prescriptive* spec that tries to
> tell them to stop mapping. I have always liked the tendency of RFCs to
> be descriptive rather than prescriptive, though of course there has to
> be some balance between the two, particularly during the initial
> stages of a protocol's adoption.

Of course, the IDNA200X protocol draft does not tell developers to
stop mapping, but it does not give an exact description of the mapping
either.

I might even be OK with an Experimental RFC that precisely describes
how the mappings are derived from any current or future version of
Unicode, as long as these details are written down somewhere.

Presumably, the IDNA200X protocol document would re-enter the
Standards Track at Proposed?

> > We also need to remember that, if the predictions heard around
> > ICANN, IGF, and similar forums are to be believed, there is
> > almost no use of IDNs today compared to what we will see in a
> > few years.  If those predictions are to be believed, then this
> > is our opportunity to learn from problems that we have seen in
> > IDNA2003 and get things right before IDNs really take off.  The
> > alternative is to carry the mistakes we have made and
> > infelicitous features we have created forward and have to live
> > with them forever, a decision that will certainly lead to a
> > louder chorus of a claim that has already been made, i.e., that
> > IDNs are inherently discriminatory against any language other
> > than English.
>
> I don't really understand the last sentence. Would you please give a
> couple of examples? Maybe the German eszett or the Turkish dotted and
> dotless i?

I just checked, and the upper-case I-with-dot (İ) is preserved by
IDNA2003's ToUnicode(ToASCII(x)), and so is the lower-case
i-without-dot (ı). The Eszett (ß), however, gets mapped to ss, so it
is lost. Is this the kind of thing you are referring to when you say
"discriminatory against any language other than English"?

Erik