IDN trends

Sat Dec 15 04:02:41 CET 2007

On Dec 14, 2007 11:58 AM, John C Klensin <klensin at jck.com> wrote:
>
>
> --On Thursday, 13 December, 2007 19:50 -0800 Erik van der Poel
> <erikv at google.com> wrote:
>
> >...
> > So, 15% is an upper bound for that number in Nov 2006 and 4%
> > is an upper bound for that number in Nov 2007. Either way, 8%
> > could still fall between these numbers, so I'm satisfied with
> > the results.
> >
> > Just because this number is falling does not mean that it will
> > ever reach zero, nor can I predict what the browser developers
> > will do. They may try to force things, by removing the
> > nfkc/case/dot mapping. Or they may not. I don't know.
>
> Erik,
>
> For whatever it is worth, the browser developers have already
> shown a willingness to interpret (or, if necessary, ignore) the
> standard in order to meet their criteria for protecting their
> customers.  The most obvious example is the provision of RFC
> 3490 that _requires_ that punycode-style labels be displayed
> only under exceptional circumstances.  The common practice today
> is to display those native-form strings only when some trigger
> criterion is met (choice of TLD for some browsers, availability
> of the relevant characters as part of an installed language for
> at least one, possibly other criteria for still others).

I wasn't referring to how labels would be displayed. (See below.)

> The discussions with them in early stages after the "paypal"
> incident led to recommendations (some of them, if I recall, from
> people who are now opposing removal of mappings from the
> standard) to display only lower case for IDNs, or more
> generally, only the result of reverse-mapping through ToASCII
> and back, regardless of what was in the reference on the grounds
> that the target strings were less easily confusable than the
> range of possible sources.  So, for browsers that followed those
> recommendations, users are not seeing the un-mapped strings even
> if they appear in URLs.

Again, I wasn't referring to how the labels are displayed.

> I don't know where this leaves us.  If our primary criterion is
> to avoid disturbing any current use or representation of IDNs
> (in or out of guidelines), then I think we will end up with
> either a "no changes at all unless they are strictly expansions
> of the set of valid strings" model or some approximation of the
> "one set of rules for strings that use nothing but Unicode 3.2
> characters and another set for strings that use anything more
> recent" model that I outlined in an earlier note.  As Patrik
> indirectly pointed out, the first model essentially requires
> that, once the Unicode Consortium assigns a property value, that
> property value must be fixed forever, with no possibility of
> revision, to be stable enough (a look-aside list for previous
> values doesn't work unless it becomes just a way to create that
> immutable list).   And the second probably gets us into silly
> states if we preserve the dot-mappings and hence require that
> any processing that is conditioned on whether or not characters
> appear in Unicode 3.2 be based on an entire FQDN, not individual
> labels (e.g., the same label-string could be interpreted
> differently depending on the FQDN in which it occurred).

I'd prefer the first model, but I think we need to tighten up the
rules about unassigned characters. We'd probably want implementations
that claim conformance to one version of Unicode to reject FQDNs with
characters that are unassigned in that version of Unicode, so that the
implementation does not leave upper-case as is, or try to perform NFKC
on characters that it does not know about.

> If, by contrast, we assume that browser vendors (and those who
> produce code for other applications ... once again, if we could
> assume that IDNs will be used only on the web, the problems
> would change considerably) will be reasonably careful about what
> is right and reasonable for their audiences, then it becomes
> reasonable to predict little change for the typical end-user.
> Things that they expect to be mapped will be mapped.  Things
> that they don't expect to be mapped will be less likely to be
> mapped, but that will reduce confusion about unfamiliar
> variations of unfamiliar characters.  And the on-the-wire forms
> will not be mapped at all, which will reduce several
> opportunities for confusion and the potential for
> interoperability failures.

The browsers *already* map on-the-wire forms, i.e. URIs/IRIs in HTML.
In the case of an <a> tag, the user must consciously click on it, but
in the case of an <img> tag, the browser automatically performs
IDNA2003, so there would be interoperability problems if browsers
stopped mapping a la IDNA2003. Now, one could argue that there are so
few non-ASCII URIs/IRIs on the Web that it wouldn't matter if the
browsers stopped mapping, but I haven't seen any indication from the
browser developers that they will stop mapping.

In which case, I'd rather have a *descriptive* spec that explains what
the browsers, etc are doing, than a *prescriptive* spec that tries to
tell them to stop mapping. I have always liked the tendency of RFCs to
be descriptive rather than prescriptive, though of course there has to
be some balance between the two, particularly during the initial
stages of a protocol's adoption.

> We also need to remember that, if the predictions heard around
> ICANN, IGF, and similar forums are to be believed, there is
> almost no use of IDNs today compared to what we will see in a
> few years.  If those predictions are to be believed, then this
> is our opportunity to learn from problems that we have seen in
> IDNA2003 and get things right before IDNs really take off.  The
> alternative is to carry the mistakes we have made and
> infelicitous features we have created forward and have to live
> with them forever, a decision that will certainly lead to a
> louder chorus of a claim that has already been made, i.e., that
> IDNs are inherently discriminatory against any language other
> than English.

I don't really understand the last sentence. Would you please give a
couple of examples? Maybe the German eszett or the Turkish dotted and
dotless i?

Erik