Unregistered code points and new prefixes (was: Re: sharp s(Eszett))

Sun Mar 9 06:02:42 CET 2008

Let me clarify my current thinking in this area. Some of you may think
that I contradict myself when I agree with MSIE7's processing of
non-ASCII domain names, but disagree with its processing of ASCII
domain names.

When a piece of software pre-processes a non-ASCII string and converts
it to Punycode, it is taking *responsibility* for those actions. If
the non-ASCII string contains code points that are unassigned in the
version of Unicode that the implementation supports, it must *not*
convert to Punycode (and certainly must not prepend the "xn--" prefix,
nor emit a DNS packet). Why? Because if it did emit a DNS packet, it
risks being incompatible with other pieces of software (or even a
future version of itself) that emit a *different* DNS packet for the
same string, thereby creating an interoperability problem.

On the other hand, when that same piece of software processes an ASCII
string that may or may not be a valid A-label, it is *not* responsible
for the validity of that A-label. Some other piece of software has
converted a non-ASCII string to Punycode, and *that* piece of software
is responsible. The *only* responsibility that a piece of software
faced with an ASCII domain name has, in my opinion, is to be careful
about how it *displays* that domain name (and the rest of the URI/IRI,
if applicable).

If the ASCII domain name contains labels that turn out to have the
"xn--" prefix, and the following string can even be decoded using the
Punycode decoding algorithm, then the implementation can check the
resulting Unicode code points, not only for validity in IDNA2003 and
IDNA200X, but also for spoofing issues, availability of glyphs in
system fonts, and so on.

If a label turns out to have invalid or dangerous sequences of code
points, some implementations may simply opt to display the original
ASCII string, with "xn--" prefix, Punycode gibberish, warts and all.
However, even that could be "dangerous" because the ASCII string may
itself be readable, e.g. xn--cocacola or xn--intel, two examples that
people often mention. So the implementation may decide to display
something else, such as a warning, or whatever.

Martin said it the way I would have, roughly, "Who are you trying to
protect, and from what?" It does not make much sense to disallow the
lookup of an invalid A-label by an IDNA-aware application, let alone
an IDNA-unaware application.

However, if the consensus of the group is to disallow invalid A-label
lookups, then I will reluctantly go along with it. It does make
migration to future versions of Unicode a bit harder than it needs to
be. Michel said it quite well too, something like "I'm not very
excited about it."

Erik

On Sat, Mar 8, 2008 at 6:11 PM, Martin Duerst <duerst at it.aoyama.ac.jp> wrote:
> I very strongly agree with what Mark is saying below.
>  I wanted to write very much the same thing, but Mark has
>  done an excellent job, most probably better than what I'd
>  have done.
>
>  I can't see any serious reason why the "client sends unknown"
>  rule of IDNA2003 should be changed, at all. The benefits
>  (listed below by Mark) are obvious. Also, as Mark says,
>  this aspect is independent of other changes that we are
>  looking at.
>
>  Regards,   Martin.
>
>
>
>  At 08:39 08/03/08, Mark Davis wrote:
>  >...
>  >>My conclusions are:
>  >>
>  >>(1) Looking up unregistered code points is untenable because it
>  >>makes moving to future versions of Unicode impossible.  That
>  >>conclusion is already reflected in IDNA200X, but IDNA2003
>  >>requires such lookups.
>  >
>  >I disagree. While I'm willing to live with the John, Harald, and Patrik's decision to disallow the resolution with unassigned characters -- just so we can get this thing out the door -- we should not be basing any other decisions on thinking that it is "untenable".
>  >
>  >Consider an character X that was unassigned in Unicode 5.1, but assigned in Unicode 6.0, and see what happens. Let's suppose that a U5.1 client sends out "aXc.com" ("a" and "c" are some particular strings, not the literal U+0061 and U+0063). Before the registry upgrades to U6.0, it will fail, as expected -- it wasn't (and couldn't have been) registered.
>  >
>  >So let's look at the case where the registry has upgraded to U6.0. There are a small number of cases, and I don't see that *any* of them cause a problem.
>  >
>  >Cases:
>  >    * X is illegal according to IDNA200X rules under U6.0. The registry can't register it, so it won't work. Not a problem.
>  >    * X is legal and unaffected by normalization. This is true of the vast majority of characters. Then if the registry adds "aXc.com", then the old client will work, as expected. Not a problem -- in fact, a positive benefit.
>  >    * X is legal but affected by normalization -- but not in the context of "a...c". This is true of the vast majority of those few characters remaining from case #2. Then if the registry adds "aXc.com", then the old resolver will work. Not a problem -- in fact, a positive benefit.
>  >    * X is legal, and affected by normalization, in the context of "a...c". For example, suppose that string a ends with a non-spacing mark that reorders with X in NFC. In that case, "aXc.com" would not be legal, and could not be registered. So even in this rare case, not a problem.
>  >John, if you think this situation is untenable, which of the above cases causes a problem, and exactly what would that problem be?
>  >
>  >Mark
>  >
>  >
>  >_______________________________________________
>  >Idna-update mailing list
>  >Idna-update at alvestrand.no
>  >http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
>  #-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>  #-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp
>
>  _______________________________________________
>  Idna-update mailing list
>  Idna-update at alvestrand.no
>  http://www.alvestrand.no/mailman/listinfo/idna-update
>