IDN trends

Fri Dec 14 20:58:37 CET 2007

--On Thursday, 13 December, 2007 19:50 -0800 Erik van der Poel
<erikv at google.com> wrote:

>...
> So, 15% is an upper bound for that number in Nov 2006 and 4%
> is an upper bound for that number in Nov 2007. Either way, 8%
> could still fall between these numbers, so I'm satisfied with
> the results.
> 
> Just because this number is falling does not mean that it will
> ever reach zero, nor can I predict what the browser developers
> will do. They may try to force things, by removing the
> nfkc/case/dot mapping. Or they may not. I don't know.

Erik,

For whatever it is worth, the browser developers have already
shown a willingness to interpret (or, if necessary, ignore) the
standard in order to meet their criteria for protecting their
customers.  The most obvious example is the provision of RFC
3490 that _requires_ that punycode-style labels be displayed
only under exceptional circumstances.  The common practice today
is to display those native-form strings only when some trigger
criterion is met (choice of TLD for some browsers, availability
of the relevant characters as part of an installed language for
at least one, possibly other criteria for still others). 

The discussions with them in early stages after the "paypal"
incident led to recommendations (some of them, if I recall, from
people who are now opposing removal of mappings from the
standard) to display only lower case for IDNs, or more
generally, only the result of reverse-mapping through ToASCII
and back, regardless of what was in the reference on the grounds
that the target strings were less easily confusable than the
range of possible sources.  So, for browsers that followed those
recommendations, users are not seeing the un-mapped strings even
if they appear in URLs.

I don't know where this leaves us.  If our primary criterion is
to avoid disturbing any current use or representation of IDNs
(in or out of guidelines), then I think we will end up with
either a "no changes at all unless they are strictly expansions
of the set of valid strings" model or some approximation of the
"one set of rules for strings that use nothing but Unicode 3.2
characters and another set for strings that use anything more
recent" model that I outlined in an earlier note.  As Patrik
indirectly pointed out, the first model essentially requires
that, once the Unicode Consortium assigns a property value, that
property value must be fixed forever, with no possibility of
revision, to be stable enough (a look-aside list for previous
values doesn't work unless it becomes just a way to create that
immutable list).   And the second probably gets us into silly
states if we preserve the dot-mappings and hence require that
any processing that is conditioned on whether or not characters
appear in Unicode 3.2 be based on an entire FQDN, not individual
labels (e.g., the same label-string could be interpreted
differently depending on the FQDN in which it occurred).

If, by contrast, we assume that browser vendors (and those who
produce code for other applications ... once again, if we could
assume that IDNs will be used only on the web, the problems
would change considerably) will be reasonably careful about what
is right and reasonable for their audiences, then it becomes
reasonable to predict little change for the typical end-user.
Things that they expect to be mapped will be mapped.  Things
that they don't expect to be mapped will be less likely to be
mapped, but that will reduce confusion about unfamiliar
variations of unfamiliar characters.  And the on-the-wire forms
will not be mapped at all, which will reduce several
opportunities for confusion and the potential for
interoperability failures.

We also need to remember that, if the predictions heard around
ICANN, IGF, and similar forums are to be believed, there is
almost no use of IDNs today compared to what we will see in a
few years.  If those predictions are to be believed, then this
is our opportunity to learn from problems that we have seen in
IDNA2003 and get things right before IDNs really take off.  The
alternative is to carry the mistakes we have made and
infelicitous features we have created forward and have to live
with them forever, a decision that will certainly lead to a
louder chorus of a claim that has already been made, i.e., that
IDNs are inherently discriminatory against any language other
than English.

     john