looking up XN-labels with unassigned characters

Mark Davis mark at macchiato.com
Mon Mar 23 18:43:34 CET 2009


I agree that it would be nice, and it would be equally nice to lookup any
Unicode-Label*. That is, if there is any reason to be suspicious of a
Unicode label, then we should be equally suspicious of the corresponding
XN-Label. If XN-Labels are presumed to be safe, then we have no reason to
mistrust the corresponding Unicode label.

It may very well be that there there is a particular set of cooperating
systems, with a gatekeeper, and that anything in XN-Label form has been
validated as being an A-Label. And in that case, it clearly doesn't make any
sense to validate over and over again. But you'd better validate on input to
that set of systems, otherwise you have no guarantee that it is in fact an
A-Label.

Similarly, you could have a particular set of cooperating systems, with a
gatekeeper, and that anything in Unicode-Label form has been validated as
being a U-Label. And in that case, it clearly doesn't make any sense to
validate over and over again. But you'd better validate on input to that set
of systems, otherwise you have no guarantee that it is in fact an U-Label.

Mark


On Mon, Mar 23, 2009 at 10:09, Erik van der Poel <erikv at google.com> wrote:

> Hi Mark,
>
> I was referring to your comment:
>
> "if it is important to check those requirements, then it is important
> to test both A and U Labels; if it is not important to test them, then
> it should not be a requirement for either one."
>
> All I'm saying is that it would be nice if a client could lookup any
> label that starts with "xn--". This would allow e.g. Google Search to
> convert legal U-labels to legal A-labels, and it would allow the
> browser to follow the links, even if the browser implemented an old
> version of IDNA.
>
> It would then be up to the browser to display the underlying Unicode
> string in a safe way. Note that browsers are already expected to apply
> certain restrictions when displaying the domain name.
>
> I do agree with your point that, now that we are at Unicode 5.1,
> future additions to Unicode are not quite as impactful to the DNS.
> We're in the "long tail".
>
> Erik
>
> On Mon, Mar 23, 2009 at 9:38 AM, Mark Davis <mark at macchiato.com> wrote:
> > First off, I have not been pushing for allowing UNASSIGNED on lookup in
> > IDNA2008. This is for two reasons:
> >
> > We have had many Unicode versions since 3.2, so the urgency is not as
> > prominent
> > Because IDNA2008 updates more regularly, there is less need.
> >
> > What I have been saying is that allowing UNASSIGNED on lookup wouldn't
> make
> > a difference, and that's the case even if a character maps to ".".
> >
> > Let's take a specific example: àbc͸dèf.com <http://xn--df-7ia.com>,
> where the middle character,
> > \u0378, is currently unassigned as far as the client is concerned
> (because
> > it is back-reved), while the registry is on Unicode 6.0. The XN form is
> > xn--bcdf-zna5c481a.com.
> >
> > Here's what happens when the client software (browser, emailer, etc)
> looks
> > the domain name up, depending on what \u0378 turns into under 6.0.
> >
> > \u0378 becomes DISALLOWED. No problem. No conformant registry can support
> > it, even on Unicode 6.0; the lookup is denied.
> > \u0378 becomes PVALID. No problem - the lookup works.
> > \u0378 becomes mapped to X (assuming we allow mapping on lookup)
> >
> > X is DISALLOWED, say "$".  No problem. No conformant registry can support
> > it, even on Unicode 6.0; the lookup is denied.
> > X is PVALID, say "X". The lookup fails. The remapped domain name would
> work
> > as xn--bcxdf-qqa4d.com, but the original URL would not work until the
> client
> > is updated, or unless the user learns to type X instead until s/he
> updates
> > his/er client.
> > X is ".". The lookup fails. The remapped domain name would work as
> > xn--bc-iia.xn--df-7ia.com, but the original URL would not work until the
> > client is updated, or unless the user learns to type X instead until s/he
> > updates his/er client.
> >
> > Whether the character maps to a dot or not in Unicode 6.0 doesn't make
> any
> > difference in the scenario. It just fails the lookup in a different way
> (3.3
> > instead of 3.2), but the lookup fails in either case.
> >
> > Mark
> >
> > On Sun, Mar 22, 2009 at 17:00, Erik van der Poel <erikv at google.com>
> wrote:
> >>
> >> Hi again James, thank you for the email. I am quite aware of the dot
> >> issues in IDNA. I have first-hand experience with Japanese input
> >> methods and their modes, and I understand the motivation for the
> >> addition of non-ASCII dot processing in IDNA2003.
> >>
> >> The issue with U+2CFE COPTIC FULL STOP is a bit subtle, so let me
> >> explain. U+2CFE was added in Unicode 4.1. This means that, from the
> >> point of view of an IDNA2003 implementation, it is simply an
> >> unassigned character. Let's say we have a domain name like:
> >>
> >> aaa <U+2CFE> bbb . com
> >>
> >> Suppose that aaa and bbb are Coptic characters, and the typist
> >> happened to have a Coptic input method (though I have no idea whether
> >> such things exist!). Further, let's suppose that the client is using
> >> IDNA2003 with the flag "allow unassigned" set to true. If aaa and bbb
> >> are already lower-case, the client will do the right thing with them
> >> (leaving them as is). However, the client will not know that U+2CFE is
> >> a new dot-like character, so it will treat the entire sequence
> >> "aaa<U+2CFE>bbb" as a single label. It will then encode it in Punycode
> >> (including the dot-like character), and try to resolve that in DNS.
> >>
> >> Of course, this will not work because the intention was to resolve
> >> aaa.bbb.com, not aaa<U+2CFE>bbb.com. In other words, a new client and
> >> an old client would resolve this name differently.
> >>
> >> I don't know how many IDNA2003 clients actually set the "allow
> >> unassigned" flag to true. It is obviously very dangerous, since the
> >> client cannot possibly know how to case-fold the new characters,
> >> including Coptic.
> >>
> >> (And this is also why Mark is wrong when he says that if clients are
> >> allowed to lookup XN-labels with unassigned characters, then they
> >> should also be allowed to lookup Unicode labels with unassigned
> >> characters.)
> >>
> >> Erik
> >>
> >> On Sun, Mar 22, 2009 at 2:33 PM, James Seng <james at seng.sg> wrote:
> >> > I think you misunderstood about the "dot" problem. It is not these
> >> > "dots" are allowed as domain name but they are identified as
> >> > "separator" like "."
> >> >
> >> > The main reason is to because when a user switch to CJK inputs, when
> >> > he press ".", most IME will spur out U+3002 instead. If you do not
> >> > identify U+3002 as a separator, then a user will have to enter CJK
> >> > IME, switch back to English, enter a ".", switch back to CJK IME etc.
> >> >
> >> > See http://tools.ietf.org/html/draft-jet-idnabis-cjk-localmapping-00
> >> >
> >> > -James Seng
> >> >
> >> > On Mon, Mar 23, 2009 at 1:51 AM, Erik van der Poel <erikv at google.com>
> >> > wrote:
> >> >> Another question from the summary:
> >> >>
> >> >>> A. Multiple characters are allowed as "dots" in domain names under
> >> >>> IDNA2003 and presumably under IDNAV2. This is a general problem for
> >> >>> all versions of IDNA but may be exacerbated by the variants for
> "dots"
> >> >>> that are permitted under IDNA2003 and IDNAv2. What is the WG view?
> >> >>
> >> >> In my view, non-ASCII dots should never have been allowed in
> IDNA2003.
> >> >> However, now that many IDNA2003 implementations have been distributed
> >> >> to users and a few stored domain names use these non-ASCII dots, some
> >> >> may feel that we have to support them (forever).
> >> >>
> >> >> Having said that, I am quite concerned about adding yet another
> >> >> non-ASCII dot in IDNAv2 (U+2CFE COPTIC FULL STOP) because IDNA2003
> >> >> includes a flag that allows for the lookup of unassigned (in Unicode
> >> >> 3.2) characters. Such applications would not only fail to case-fold
> >> >> post-Unicode-3.2 characters correctly, they would fail to divide the
> >> >> full domain name into individual labels, and since DNS labels are
> >> >> "owned" by different owners, this just seems like an invitation to
> >> >> further problems.
> >> >>
> >> >> In my view, the dot is a keyboard and UI issue. Of course, it would
> be
> >> >> nice if we could push ALL mappings out to the keyboard and UI, but,
> to
> >> >> use one of John's favorite words, this may be "unrealistic". ;-)
> >> >>
> >> >> Erik
> >> >> _______________________________________________
> >> >> Idna-update mailing list
> >> >> Idna-update at alvestrand.no
> >> >> http://www.alvestrand.no/mailman/listinfo/idna-update
> >> >>
> >> >
> >> _______________________________________________
> >> Idna-update mailing list
> >> Idna-update at alvestrand.no
> >> http://www.alvestrand.no/mailman/listinfo/idna-update
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20090323/41564a5b/attachment-0001.htm 


More information about the Idna-update mailing list