looking up domain names with unassigned code points

Mon May 12 21:23:43 CEST 2008

--On Monday, 12 May, 2008 09:30 -0700 Shawn Steele
<Shawn.Steele at microsoft.com> wrote:

>> No. For example, xn--en32g would produce U+110000, which is
>> outside the range of valid code points. (The highest code
>> point is U+10FFFF.)
> 
>> If an app receives such a punycode string, it should not
>> attempt to display the corresponding Unicode (since it is
>> invalid). I'm guessing that we can all agree on that. :-)
> 
> Well, it does indicate that *some* validation of the resulting
> Unicode string is necessary.  What happens if there's a U+0020
> or U+0007 embedded in it?

While I would prefer some validation if a putative A-label is
presented to the application (but continue to believe that
should be an implementation choice), I don't think this changes
the answer.  If the application gets a string that starts with
"xn--", it MAY just ignore IDNA and look the thing up and there
is no requirement that it ever convert it to a native character
("Unicode") form (or to try to do so).   If, on the other hand,
it intended to made the conversation at some point, I'd think it
would be lots better to try to make it before the lookup.  Doing
so effectively invokes IDNA and, if the string contains bad
characters  (U+0020 or U+0007 as well as DISALLOWED or
UNASSIGNED ones), then the user should get an error message and
the string not be looked up.

As I've said before, I don't see much choice.  If a U-label is
presented to a non-IDNA-aware application that does any checking
at all, it will be rejected as a syntax error.  If it is not so
rejected, it presumably won't be found on lookup since a
non-IDNA-aware application would not be able to perform the
U-label to A-label conversion.   However, if an A-label is
presented to that application, it is certainly going to be
treated as a conventional LDH label and looked up.  Since it is
a conventional LDH label, no one is going to try to convert it
to a native character string.

> Note that on the client side it would be required to convert
> and display the Unicode string if lookup actually succeeds.
> xn--asdfasdf isn't acceptable from the "we want our users to
> know what they're seeing" crowd.
> 
> If the client is required to display a successfully resolved
> string, then there doesn't seem to be much point in
> disallowing smiley face at this (client) level, since anything
> with a smiley that resolves would be displayed.  That would
> put the disallowed character tests at the registration level.

There is no way to require a client to do that.  We know that
experimentally, since IDNA2003 contains exactly that requirement
(although not stated that way).  To say that requirement has
been widely ignored would be incorrect-- it has been
deliberately and systematically violated in the interest of
protecting users.  We also know what clients do now.  For an
example with which you are presumably familiar, IE7 would
presumably convert the punycode form to a Unicode one, scan it,
and, upon discovering an unassigned character or smiley face,
would compare those code points to the code points associated
with the languages the user had configured and then decide to
display the punycode form (unless it permits smiley faces
regardless of the languages configured).   As I understand it,
the decision was that, even if 'xn--asdfasdf isn't acceptable
from the "we want our users to know what they're seeing" crowd',
neither a native-character string in a user-unknown script nor a
row of little boxes are acceptable to the "we have to offer
reasonable protection to our users" crowd.

> I expected some disagreement with my assertion that some
> protocols/users will require the Unicode form, so therefore
> the benefit of looking up punyicode is limited to some
> specific scenarios, probably leading to inconsistent
> experiences with "new" names.

I'm not sure I understand this.  Applications that have not been
upgraded to understand IDNA are going to look those punycode
forms up because they don't know any better _and_ because those
forms are perfectly good LDH labels.  That was a major part if
the IDNA design and some other alternatives were sacrificed to
get it.  And, of course, some protocols and users will "require
the Unicode form".  Regardless of what may be found embedded in
files or referrals of some sort, I'd certainly encourage a user
interface designer to consider whether direct typing of the
A-label form should be associated with warnings and/or "expert
user" configuration.

       john