Unregistered code points and new prefixes (was: Re: sharp s (Eszett))

John C Klensin klensin at jck.com
Fri Mar 7 21:05:48 CET 2008



--On Friday, 07 March, 2008 10:56 -0800 Erik van der Poel
<erikv at google.com> wrote:

> John,
> 
> Previously, I reported that MSIE7 refused to look up domain
> names with U+03F7 or U+03F8 in them, and I stated my opinion
> that MSIE7 was doing the right thing, because those 2
> characters were unassigned in Unicode 3.2. Implementors cannot
> predict the case-folding relationships between code points
> before they are assigned, so they should refrain from looking
> up domain names with characters that they don't know about,
> otherwise they might send out the wrong A-label (different from
> a future implementation).

This unpredictability of future relationships is the major
reason why IDNA200X prohibits looking up any code point that is
unassigned in whatever version of Unicode is being use by the
application doing the lookup.

>  Here is an example of a piece of
> HTML that I tried at that time:
> 
> <a href="http://&#x3F7;.com/">
> 
> However, now I have discovered that MSIE7 even refuses to look
> up domain names containing those characters when they are in
> A-label form! 

I believe that ought to be correct behavior if MSIE7 were an
implementation of the IDNA2003 (i.e., Unicode 3.2) tables but
using the more restrictive lookup IDNA200X rules.   Following
IDNA2003 rules strictly --in which code points unassigned in
Unicode 3.2 are expected to be looked up anyway-- says that this
should be looked up.   

Of course, looking it up gets us into a mess, because one can
push any value between U+0000 and U+10FFFF, including U+03F7,
through punycode and get a string even though one would, by
projecting the Stringprep rules forward, expect that to map into
U+03F8.

> Needless to say, my opinion of MSIE7 has now
> changed drastically. This time, I tried:
> 
> (1) <a href="http://xn--nza.com/">
> (2) <a
> href="http://xn--ngb7d.xn--mgbbgcw7khi2840d.xn--mgba3a4f16a.ir
> /"> (3) <a href="http://xn--strae-oqa.com/">
> (4) <a href="http://xm--strae-oqa.com/">
> 
> (1) has U+03F8 in it (a lower-case letter), (2) has U+200C
> (ZWNJ) in it and I found it in the lower left corner of
> http://www.nic.ir/List_of_Resellers and (3) has U+00DF
> (Eszett) in it. None of these worked in MSIE7. You can click
> on them, but no DNS packet is emitted.

See below.

> On the other hand, (4) did work. This also has Eszett in it,
> but the prefix has been changed to "xm--".

That, IMO, should be a clear bug.  The standard is, I think,
clear that labels starting with prefixes (two-character pairs
before "--") are prohibited if they are not "xn".  So either
that string should not be looked up at all (which is what I
think IDNA2003 says and IDNA200X definitely says) or, at best,
"xm--strae-oqa" should be treated as a very odd-looking label
with no IDN implications at all (i.e., it can't "have Eszett in
it").
 
> Since MSIE7 is so widely installed, to me, this means that we
> probably have to reopen the discussion of whether we will
> switch to a new prefix.

I draw a somewhat different conclusion, only partially because,
as Vint says, a new prefix would be very painful.  A new prefix
would make every in-DNS IDN out there today incompatible and
would almost certainly require a slow and painful fallback
lookup process for many years if not forever... not very
attractive, even if we could do it.

My conclusions are:

(1) Looking up unregistered code points is untenable because it
makes moving to future versions of Unicode impossible.  That
conclusion is already reflected in IDNA200X, but IDNA2003
requires such lookups.

(2) When we do things that are untenable, the odds of
implementations simply ignoring us and doing what they consider
The Right Thing are very high.  This particular example with
MSIE7 is, I think, the best one so far, but we have others.
The much more restrictive rules of IDNA200X are intended,
long-term, to give us more latitude because it would be
possible, if painful, to change a prohibited (Disallowed) code
point to be Protocol-Valid without thereby creating an ambiguity
in coding of labels "before" and "after".

(3) The IDNA200X proposals make the assumption that there are a
collection of cases that violate at least one of

	(i) the clear intent of IDNA2003
	
	(ii) the clear intent of either the cautionary notes in
	the IESG Statement about IDNs, the general guidance of
	the ICANN Guidelines, or both
	
	(iii) Good sense, including thoughtful application of
	the robustness principle and moving outside IDNA2003
	where its provisions cannot be reasonably implemented in
	practice without causing other problems.

Some of these cases represent legitimate, albeit possibly
misguided, uses.  Others represent defensive registrations, and
still others represent behavior that is either deliberately
excessively cute or malicious.   They are very hard to tell
apart, even by walking the DNS tree, inspecting the records of
various registries, or inspecting a large corpus of documents.

We believe that these cases are sufficiently problematic that
the right course of action is to minimize the degree to which
they can arise in the future and then to work out the transition
or compatibility questions for existing registrations and
applications.   In some cases, that will imply that registries
will either need to adopt variant techniques or prohibit some
registrations that the standard will permit in order to avoid
ambiguities.  In others, we may end up having to take the
position that the registrations are simply not going to work any
more (if it were even the case that they worked reliably now).
That "so sorry... you pushed the boundaries of the rules and you
now lose" scenario is going to be a difficult one, but the
alternatives are to invalidate a few existing domains versus
going the "next prefix" route and invalidating everything.

If we want viable IDNs -- usable as part of high-quality
identifiers and able to support the whole range of applications
that depend on, and infrastructure built on top of, the DNS-- we
need get this right and to do so now.  And that may require
cleaning out a certain amount of cruft rather than saying "cruft
now and cruft forever" (which is clearly one of the
alternatives).

    john



More information about the Idna-update mailing list