Table-building

John C Klensin klensin at jck.com
Fri Feb 2 00:16:26 CET 2007



--On Thursday, 01 February, 2007 13:49 -0800 Erik van der Poel
<erikv at google.com> wrote:

>> It is clearly stupid of any registry to allow the
>> registraition of such characters, given that the property MAY
>> end up false. It is equally stupid of any application
>> developer to deny the attempt to lookup such characters,
>> given that the property MAY end up true.
> 
> Somewhat related to this, MSIE7 currently does not allow the
> lookup of
> a URL containing U+03F7 or U+03F8. Firefox 1.5, on the other
> hand,
> will cheerfully lookup xn--mza or xn--nza, respectively, even
> though
> U+03F7 has a lower-case mapping to U+03F8 in Unicode 4.0.
> 
> I think I prefer MSIE7's behavior.

This is _exactly_ the reason why we are arguing that unassigned
code points should not be looked up -- thanks for the excellent
example.  (To save people the checking I just did, these two
code points are unassigned in Unicode 3.2, the relevant version
for the current IDNA standard, but were added in 4.0).   

Let's walk through this:

* The IDNA2003 requirement is that putative labels containing
unassigned code points are looked up.  So Firefox is behaving
according to the standard and MSIE7 is not in conformance.  

* However, were we to upgrade IDNA2003 to the current version of
Unicode without making any other changes, U+03E7 would become
invalid for actual lookup because we would presumably expect it
to be mapped to U+03E8 in Stringprep's case-mapping function.

Now let's assume that a registry, following IDNA2003bis,
registers a label containing U+03E8.  Assume a user then types
in a domain name that contains U+03E7 to her favorite browser.
We then have:

MSIE7.0: will not look it up and resolve it.  Presumably, it
will tell the user that the label is invalid, _not_ that it is
not found.  That distinction is very important.   Armed with the
knowledge (perhaps after a discussion with the owner of the
name) the user will start clamoring for MSIE7.1.

MSIE7.1, which was upgraded to IDNA2003bis, will map U+03E8 to
U+03E7, which will be looked up successfully.

Firefox 1.5, which is presumably stuck forever on IDNA2003 and
Unicode 3.2, will look up the Punycode-converted version of the
label containing U+03E8.  It will get an authoritative "not
found" since only the label containing U+03E7 is in the DNS.
That false negative is _very_ bad news.

Firefox 2.something, which was upgraded to IDNA2003bis, will
work exactly the way MSIE7.1 works, which I think is the desired
behavior.

My conclusion: one dare not look up a character whose status is
"unassigned" in whatever version of Unicode underpins the
libraries one is using.   One can't know if such a character
will turn out to case-map to something else.  One also can't
know that it won't have an NFKC mapping to something else or
properties that require special handling prior to lookup.   Some
of those cases can be dealt with simply by saying "it won't be
registered and therefore there is no problem" (assuming all
registries follow the rules), but some, including case-mapping,
cannot.   And, in practice, even viewing most or all NFKC and
case mappings as external to the protocol (watch for
...idnabis-issues-01 early next week) doesn't change this
situation.

    john




More information about the Idna-update mailing list