looking up domain names with unassigned code points

Mon May 12 17:22:48 CEST 2008

On Sat, May 10, 2008 at 12:59 PM, John C Klensin <klensin at jck.com> wrote:
>
>
> --On Saturday, May 10, 2008 12:17 PM -0700 Erik van der Poel
> <erikv at google.com> wrote:
>
>> John,
>>
>> Thanks for responding. I'm not sure what the right answer is,
>> either. Yes, I was referring to domain names and URIs that are
>> already in Punycode form, and I agree that the situation in
>> which the app receives the Unicode form is very different
>> (primarily because of the unassigned code point issue).
>>
>> I also agree with your suggestions below. My main concern with
>> simply letting apps look up domain names that are already in
>> Punycode form is that some apps may also blindly convert the
>> Punycode to Unicode for display, without checking for
>> dangerous characters like U+2044 FRACTION SLASH. The TLD
>> registries are under a certain amount of pressure to only
>> register "safe" names, but at lower levels of the DNS, there
>> is very little pressure and practically zero enforcement.
>
> Yes, and that is a major concern. But see below.
>
>> However, I don't know how comfortable you and others in the
>> working group are about writing advice regarding display
>> issues in the IDNA200X RFCs.
>
> I think we get out into dangerous territory if we give more than
> general advice about display and I think some will argue that we
> should not do even that.   But I don't see that as an issue in
> this case.
>
> There will be an issue in getting the wording right (for which I
> will certainly need help).   But I think that "MAY treat the
> putative A-label as opaque" rule can be written to give the
> implementation a choice between opaque or not.  So, e.g.,
>
>        * If you decide to treat it as opaque, you look it up
>        without inspecting its contents but don't, ever, convert
>        it to a U-label.
>
>        * If you do decide to convert it to a U-label, then it
>        isn't opaque, it must be valid as a U-label (and hence
>        as an A-label).  Obviously, if it contains DISALLOWED or
>        UNASSIGNED characters, or even CONTEXT-required
>        characters that don't follow whatever rules need to be
>        followed for looking, then you need to treat it as
>        invalid for lookup and tell the user whatever you tell
>        the user under such circumstances.

Having re-read this proposal and thought about it some more, it
appears that it does not really allow for the easier transitions that
I have been talking about. If IDNA-aware clients are permitted to look
up labels that are already in Punycode and contain unassigned or
disallowed code points, then it would be somewhat easier to make
transitions in the future, e.g. from unassigned to pvalid or from
disallowed to pvalid. Such transitions would be easier because people
that want to start using the newly pvalidated characters can use them
in Punycode form and be assured that IDNA-aware clients will at least
look them up, thereby providing for a minimal functionality.

If I may argue the "other side" of this, one reason that we don't want
to allow clients to look up labels with disallowed characters is to
deter zone operators from registering such labels. One might also
argue that Unicode 5.1 is really quite mature now, and that future
assignments will be less and less interesting from the IDN
perspective, thereby making easy transitions less important. If the
yes/no/maybe discussion for historic scripts makes good progress, we
may also end up having fewer reasons to move characters from
disallowed to other categories.

In order to convince implementors that our rules are worth following,
I believe we will need clear and convincing enough reasons to put such
things as symbols in the disallowed set. Otherwise, implementors might
decide not to follow our rules for some scripts or character types,
and we would end up with less interoperability.

So, I'm going to try to find time to take another good look at the
long(!) drafts to see if our reasoning is convincing enough. It might
also be good to simplify and shorten the drafts somehow. Implementors
are less likely to read and understand our reasons if they are buried
in long and hard-to-understand documents. This is not to say that John
hasn't been doing a great job -- on the contrary, this is very hard
work because of all of the complexities, and these drafts tend to grow
"organically", sometimes without removing parts that need to be
removed.

One paragraph, in particular, probably needs attention:

5.  Domain Name Resolution (Lookup) Protocol

   Resolution is conceptually different from registration and different
   tests are applied on the client.  Although some validity checks are
   necessary to avoid serious problems with the protocol (see
   Section 5.4 ff.), the resolution-side tests are more permissive and
   rely heavily on the assumption that names that are present in the DNS
   are valid.  Among other things, this distinction, applied carefully,
   facilitates expansion of the permitted character lists to include new
   scripts and accommodate new versions of Unicode without introducing
   ambiguity into domain name processing.

Some might interpret this to mean that it is OK to look up labels with
unassigned characters. Or am I misunderstanding this?

Erik

> Note that the issue here isn't above display, it is about valid
> conversions between things that are supposed to be A-labels and
> the corresponding U-labels.
>
> Does that model help?
>
>    john
>
>
>