IDN trends

Sat Dec 15 18:04:25 CET 2007

--On Saturday, 15 December, 2007 07:55 -0800 Erik van der Poel
<erikv at google.com> wrote:

>...
>> I'd prefer the first model, but I think we need to tighten up
>> the rules about unassigned characters. We'd probably want
>> implementations that claim conformance to one version of
>> Unicode to reject FQDNs with characters that are unassigned
>> in that version of Unicode, so that the implementation does
>> not leave upper-case as is, or try to perform NFKC on
>> characters that it does not know about.
> 
> An example of this is U+03F7, which has a lower-case mapping
> to U+03F8 in Unicode 4.0. Both of these are unassigned in
> Unicode 3.2, but Firefox 1.5 and 2 do not reject these
> characters. Instead they send out two *different* DNS packets,
> depending on whether the upper-case U+03F7 or the lower-case
> U+03F8 was present in the original. This is an
> interoperability problem, since MSIE 7 and Opera 9 both wisely
> reject such labels. U+03F7 is NEVER in IDNA200X, while U+03F8
> is ALWAYS.

Thanks for the specific example, which I hadn't had time to dig
out (I'm on travel again, between planes at the moment).  This
sort of thing --both wrt case-mapping and wrt NFKC-- is exactly
why we are in need of a strong ban on unassigned characters. If
one believes that IDNA implementations can be locked at 3.2 in
practice then one could claim that Firefox's handling these
packets at all is a protocol violation since they do not appear
in the Nameprep/ Stringprep tables.  But it may be another
illustration of why it is hard or impossible to bind an IDNA
implementation to a particular version of Unicode.

> This has been discussed before, but I wonder whether anyone has
> changed their mind about this.

I have, if anything, gotten more convinced.  But my opinion is
not the most important one here.

>> > If, by contrast, we assume that browser vendors (and those
>> > who produce code for other applications ... once again, if
>> > we could assume that IDNs will be used only on the web, the
>...

>> The browsers *already* map on-the-wire forms, i.e. URIs/IRIs
>> in HTML. In the case of an <a> tag, the user must consciously
>> click on it, but in the case of an <img> tag, the browser
>> automatically performs IDNA2003, so there would be
>> interoperability problems if browsers stopped mapping a la
>> IDNA2003. Now, one could argue that there are so few
>> non-ASCII URIs/IRIs on the Web that it wouldn't matter if the
>> browsers stopped mapping, but I haven't seen any indication
>> from the browser developers that they will stop mapping.

If they don't, I don't see it as a problem.  But I do see it as
important that we move toward URLs that are as unambiguous and
directly comparable as possible.  To take a handy example, while
one could certainly write IDNA-specific comparison code
(converting any U-labels to A-labels before comparing), the
theory behind IDNA suggests that one should not have to
recognize IDNs and perform that extra operation in order to know
whether two links should be counted as pointing to the same
place.

>> In which case, I'd rather have a *descriptive* spec that
>> explains what the browsers, etc are doing, than a
>> *prescriptive* spec that tries to tell them to stop mapping.
>> I have always liked the tendency of RFCs to be descriptive
>> rather than prescriptive, though of course there has to be
>> some balance between the two, particularly during the initial
>> stages of a protocol's adoption.
> 
> Of course, the IDNA200X protocol draft does not tell
> developers to stop mapping, but it does not give an exact
> description of the mapping either.
> 
> I might even be OK with an Experimental RFC that precisely
> describes how the mappings are derived from any current or
> future version of Unicode, as long as these details are
> written down somewhere.

While I had hoped to avoid it, largely because of concerns about
available time and cycles, you are making what seems to me to be
a strong case for a document that describes the types of
mappings that might be appropriate in various circumstances.  My
gut instinct is that, e.g., case mappings, at least silent case
mappings, may not be a good idea when the users aren't used to
looking at scripts that normally handle case and that, for
systems localized for such users and for users who might expect
special handling of the odd cases (such as the notorious Turkic
dotless "i") warnings or rejection might be better than case
mappings.  (I'd like to hear from Gerv and others on the browser
side about that subject.)

Such a document could describe the mapping situation, review the
IDNA2003 rules and the compatibility issues, discuss the
tradeoffs, etc.  It might be Experimental, as you suggest, or
Informational, or even a BCP.   If you and others think that
would be of significant help, I'll start drafting (although I'd
be very pleased if someone else wanted to take the job on).

> Presumably, the IDNA200X protocol document would re-enter the
> Standards Track at Proposed?

Yes, absolutely.

>> > We also need to remember that, if the predictions heard
>> > around ICANN, IGF, and similar forums are to be believed,
>> > there is almost no use of IDNs today compared to what we
>...
>> > The alternative is to carry
>> > the mistakes we have made and infelicitous features we have
>> > created forward and have to live with them forever, a
>> > decision that will certainly lead to a louder chorus of a
>> > claim that has already been made, i.e., that IDNs are
>> > inherently discriminatory against any language other than
>> > English.
>> 
>> I don't really understand the last sentence. Would you please
>> give a couple of examples? Maybe the German eszett or the
>> Turkish dotted and dotless i?

To understand comments like that one, you need to appreciate the
political part of this design effort, a situation in which there
seem to be a number of parties who would like to reach the
conclusion that IDNs are discriminatory against their languages
(or an attempt to keep the Internet "in English" (even though we
all know better as far as content is concerned)) and who are
looking for things to point at to "prove" their position.  If
someone had the right (or wrong) incentives, such Latin-script
oddities as Eszett and Turkish "i"s, would count.   But I'm more
concerned about problems with Arabic (e.g., should final-form
character match intermediate ones, a question to which the best
answer is probably "sometimes") and Indic scripts (e.g., how
does one decide whether to apply a Hindi or Nepali rendering
without knowing the language and while having far too few
characters to guess).

> I just checked, and the upper-case I-with-dot (İ) is
> preserved by IDNA2003's ToUnicode(ToASCII(x)), and so is the
> lower-case i-without-dot (ı). The Eszett (ß), however, gets
> mapped to ss, so it is lost. Is this the kind of thing you are
> referring to when you say "discriminatory against any language
> other than English"?

Not specifically, but see above.  Eszett is further complicated
by the fact that standard orthography in some German-speaking
areas encourages mapping it out (quite independent of anything
having to do with IDNA) while standard orthography in others
encourages preserving it and considers the "ss" mapping to be
appropriate only when required by typographical convenience.

I've got to go catch a plane, but will try to construct a note
that summarizes what I believe to be the absolute minimum list
of changes we need to make to IDNA2003 to provide stable and
predictable operation long-term and mail it when I next touch
down (many hours from now).

best,
      john