looking up domain names with unassigned code points

John C Klensin klensin at jck.com
Sat May 10 20:46:07 CEST 2008

Erik (and others),

I've been silent on this because I'm not sure what the right
answer is.  Just to be sure, we are talking _only_ about
situations in which the domain name (or URI) presented to the
application is already in Punycode form (i.e., it is a putative
A-label), and not something that is to be converted to an
A-label by that application.   I believe that the situation in
which a Unicode string is presented to the application is _very_

That uncertainty is driven by two conclusions:

	* If we were to insist that the punycode form be checked
	to determine whether it contains unassigned (or
	DISALLOWED) code points, there is no possible way that
	IDNA-unaware applications could comply.  Those
	applications simply don't know that the punycode string
	is anything but an LDH domain name and there is no way
	that we can specify that a subset of those domain names
	be processed in some special way.
	* For IDNA-aware applications, I believe that,
	regardless of what we say, some application implementers
	are going to do what they think best of their users.
	There are strong arguments in both directions, driven by
	user safety, performance, code sequence patterns,
	assumptions about how frequently updates will occur, and
	a collection of other considerations, many (or most) of
	which have already been discussed on this list.

Since I believe that facing reality is generally good, I suggest
that we:

	* Add a section to "protocol" that discusses this case.
	* Specify that the application MAY convert the putative
	A-label to a U-label, make the check, and reject if
	UNASSIGNED or DISALLOWED characters are found.
	* Discuss the tradeoffs as advice about how applications
	should make the decision.

Is that plausible?  I think is is consistent with several of the
suggestions that have been made, especially those that say that
this one is ultimately an implementation decision.


--On Saturday, May 10, 2008 8:22 AM -0700 Erik van der Poel
<erikv at google.com> wrote:

>> Given the security fuss with the introduction of IDNA2003,
>> the browsers opted to permit only the permitted names and
>> exclude the "illegal" ones, which seems like a sensible
>> approach given the negative feedback.
> When you say "the browsers", which ones do you mean? I tested
> IE7 and Firefox2 with the following domain names that are
> *already* in Punycode, and IE7 refused to look up the first 3
> (did not emit a DNS packet according to the sniffer), while
> Firefox2 looked up all of them:
> (1) <a href="http://xn--nza.com/">
> (2) <a
> href="http://xn--ngb7d.xn--mgbbgcw7khi2840d.xn--mgba3a4f16a.ir
> /"> (3) <a href="http://xn--strae-oqa.com/">
> (4) <a href="http://xm--strae-oqa.com/">
> (1) has U+03F8 in it (a lower-case letter introduced in
> Unicode 4.0), (2) has U+200C (ZWNJ) in it and I found it in
> the lower left corner of http://www.nic.ir/List_of_Resellers
> (this character is being proposed for IDNA200X) and (3) has
> U+00DF (Eszett) in it (also discussed recently).
> (4) also has Eszett in it, but the prefix has been changed to
> "xm--". (I don't want to introduce another prefix, though.)
>> Its also completely unclear to me where the standard says
>> that one should assume Punycode is safe and just use it.  On
>> the contrary, I recall that there were words disallowing
>> illegal xn-- constructs that weren't valid punycode (granted
>> Punycode is superset of IDNA, but still.)
> As far as I know, the only part of RFC 3490 that touches on
> anomalous xn-- constructs is steps 3 to 7 of section 4.2:
> http://www.ietf.org/rfc/rfc3490.txt
> Those steps are part of ToUnicode, which is about display, not
> lookup. Does anyone else know of a place in the IDNA2003 RFCs
> that specifies whether or not lookup of labels that are
> already in Punycode is allowed?
> I think IDNA200X should specify whether that is allowed, and
> give clear reasons for the choice, so that client implementors
> don't second-guess the RFC authors.
>>> Now, as we try to accommodate ZWJ and other characters in
>>> IDNA2008, we find that we can no longer assume that those
>>> LDH characters will guarantee that old software will look up
>>> the domain name. In a sense, IE7 missed one of the main
>>> points of the design of IDNA2003.
>> That's sort of irrelevent at this point :)  IE uses the
>> normalization component, which I expect to be updated fairly
>> soon after a new spec is written.  Unless the new standard
>> goes beyond the Punycode form and mapping/normalization steps
>> in 2003, I'm hoping that we can just swap out the component.
>>  Of course some users won't get the benefit for some time,
>>  but I'm hopeful that a large number of users can take
>> advantage of the new standard within a reasonable period
>> after its release.
> I guess time will tell how smooth the transition from IDNA2003
> to IDNA200X is.
> Erik
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

More information about the Idna-update mailing list