looking up XN-labels with unassigned characters

Mon Mar 23 19:15:26 CET 2009

Ah, OK, I see your point of view now.

I was thinking along the lines of using A-labels in outgoing HTML
because MSIE6 does not support U-labels. To me, it's somewhat
analogous to 7-bit email that could "pass through" any intermediary --
i.e. XN-labels should "pass through" any obstacle, unimpeded.

Display is a separate issue.

Erik

On Mon, Mar 23, 2009 at 10:43 AM, Mark Davis <mark at macchiato.com> wrote:
> I agree that it would be nice, and it would be equally nice to lookup any
> Unicode-Label*. That is, if there is any reason to be suspicious of a
> Unicode label, then we should be equally suspicious of the corresponding
> XN-Label. If XN-Labels are presumed to be safe, then we have no reason to
> mistrust the corresponding Unicode label.
>
> It may very well be that there there is a particular set of cooperating
> systems, with a gatekeeper, and that anything in XN-Label form has been
> validated as being an A-Label. And in that case, it clearly doesn't make any
> sense to validate over and over again. But you'd better validate on input to
> that set of systems, otherwise you have no guarantee that it is in fact an
> A-Label.
>
> Similarly, you could have a particular set of cooperating systems, with a
> gatekeeper, and that anything in Unicode-Label form has been validated as
> being a U-Label. And in that case, it clearly doesn't make any sense to
> validate over and over again. But you'd better validate on input to that set
> of systems, otherwise you have no guarantee that it is in fact an U-Label.
>
> Mark
>
>
> On Mon, Mar 23, 2009 at 10:09, Erik van der Poel <erikv at google.com> wrote:
>>
>> Hi Mark,
>>
>> I was referring to your comment:
>>
>> "if it is important to check those requirements, then it is important
>> to test both A and U Labels; if it is not important to test them, then
>> it should not be a requirement for either one."
>>
>> All I'm saying is that it would be nice if a client could lookup any
>> label that starts with "xn--". This would allow e.g. Google Search to
>> convert legal U-labels to legal A-labels, and it would allow the
>> browser to follow the links, even if the browser implemented an old
>> version of IDNA.
>>
>> It would then be up to the browser to display the underlying Unicode
>> string in a safe way. Note that browsers are already expected to apply
>> certain restrictions when displaying the domain name.
>>
>> I do agree with your point that, now that we are at Unicode 5.1,
>> future additions to Unicode are not quite as impactful to the DNS.
>> We're in the "long tail".
>>
>> Erik
>>
>> On Mon, Mar 23, 2009 at 9:38 AM, Mark Davis <mark at macchiato.com> wrote:
>> > First off, I have not been pushing for allowing UNASSIGNED on lookup in
>> > IDNA2008. This is for two reasons:
>> >
>> > We have had many Unicode versions since 3.2, so the urgency is not as
>> > prominent
>> > Because IDNA2008 updates more regularly, there is less need.
>> >
>> > What I have been saying is that allowing UNASSIGNED on lookup wouldn't
>> > make
>> > a difference, and that's the case even if a character maps to ".".
>> >
>> > Let's take a specific example: àbc͸dèf.com, where the middle character,
>> > \u0378, is currently unassigned as far as the client is concerned
>> > (because
>> > it is back-reved), while the registry is on Unicode 6.0. The XN form is
>> > xn--bcdf-zna5c481a.com.
>> >
>> > Here's what happens when the client software (browser, emailer, etc)
>> > looks
>> > the domain name up, depending on what \u0378 turns into under 6.0.
>> >
>> > \u0378 becomes DISALLOWED. No problem. No conformant registry can
>> > support
>> > it, even on Unicode 6.0; the lookup is denied.
>> > \u0378 becomes PVALID. No problem - the lookup works.
>> > \u0378 becomes mapped to X (assuming we allow mapping on lookup)
>> >
>> > X is DISALLOWED, say "$".  No problem. No conformant registry can
>> > support
>> > it, even on Unicode 6.0; the lookup is denied.
>> > X is PVALID, say "X". The lookup fails. The remapped domain name would
>> > work
>> > as xn--bcxdf-qqa4d.com, but the original URL would not work until the
>> > client
>> > is updated, or unless the user learns to type X instead until s/he
>> > updates
>> > his/er client.
>> > X is ".". The lookup fails. The remapped domain name would work as
>> > xn--bc-iia.xn--df-7ia.com, but the original URL would not work until the
>> > client is updated, or unless the user learns to type X instead until
>> > s/he
>> > updates his/er client.
>> >
>> > Whether the character maps to a dot or not in Unicode 6.0 doesn't make
>> > any
>> > difference in the scenario. It just fails the lookup in a different way
>> > (3.3
>> > instead of 3.2), but the lookup fails in either case.
>> >
>> > Mark
>> >
>> > On Sun, Mar 22, 2009 at 17:00, Erik van der Poel <erikv at google.com>
>> > wrote:
>> >>
>> >> Hi again James, thank you for the email. I am quite aware of the dot
>> >> issues in IDNA. I have first-hand experience with Japanese input
>> >> methods and their modes, and I understand the motivation for the
>> >> addition of non-ASCII dot processing in IDNA2003.
>> >>
>> >> The issue with U+2CFE COPTIC FULL STOP is a bit subtle, so let me
>> >> explain. U+2CFE was added in Unicode 4.1. This means that, from the
>> >> point of view of an IDNA2003 implementation, it is simply an
>> >> unassigned character. Let's say we have a domain name like:
>> >>
>> >> aaa <U+2CFE> bbb . com
>> >>
>> >> Suppose that aaa and bbb are Coptic characters, and the typist
>> >> happened to have a Coptic input method (though I have no idea whether
>> >> such things exist!). Further, let's suppose that the client is using
>> >> IDNA2003 with the flag "allow unassigned" set to true. If aaa and bbb
>> >> are already lower-case, the client will do the right thing with them
>> >> (leaving them as is). However, the client will not know that U+2CFE is
>> >> a new dot-like character, so it will treat the entire sequence
>> >> "aaa<U+2CFE>bbb" as a single label. It will then encode it in Punycode
>> >> (including the dot-like character), and try to resolve that in DNS.
>> >>
>> >> Of course, this will not work because the intention was to resolve
>> >> aaa.bbb.com, not aaa<U+2CFE>bbb.com. In other words, a new client and
>> >> an old client would resolve this name differently.
>> >>
>> >> I don't know how many IDNA2003 clients actually set the "allow
>> >> unassigned" flag to true. It is obviously very dangerous, since the
>> >> client cannot possibly know how to case-fold the new characters,
>> >> including Coptic.
>> >>
>> >> (And this is also why Mark is wrong when he says that if clients are
>> >> allowed to lookup XN-labels with unassigned characters, then they
>> >> should also be allowed to lookup Unicode labels with unassigned
>> >> characters.)
>> >>
>> >> Erik
>> >>
>> >> On Sun, Mar 22, 2009 at 2:33 PM, James Seng <james at seng.sg> wrote:
>> >> > I think you misunderstood about the "dot" problem. It is not these
>> >> > "dots" are allowed as domain name but they are identified as
>> >> > "separator" like "."
>> >> >
>> >> > The main reason is to because when a user switch to CJK inputs, when
>> >> > he press ".", most IME will spur out U+3002 instead. If you do not
>> >> > identify U+3002 as a separator, then a user will have to enter CJK
>> >> > IME, switch back to English, enter a ".", switch back to CJK IME etc.
>> >> >
>> >> > See http://tools.ietf.org/html/draft-jet-idnabis-cjk-localmapping-00
>> >> >
>> >> > -James Seng
>> >> >
>> >> > On Mon, Mar 23, 2009 at 1:51 AM, Erik van der Poel <erikv at google.com>
>> >> > wrote:
>> >> >> Another question from the summary:
>> >> >>
>> >> >>> A. Multiple characters are allowed as "dots" in domain names under
>> >> >>> IDNA2003 and presumably under IDNAV2. This is a general problem for
>> >> >>> all versions of IDNA but may be exacerbated by the variants for
>> >> >>> "dots"
>> >> >>> that are permitted under IDNA2003 and IDNAv2. What is the WG view?
>> >> >>
>> >> >> In my view, non-ASCII dots should never have been allowed in
>> >> >> IDNA2003.
>> >> >> However, now that many IDNA2003 implementations have been
>> >> >> distributed
>> >> >> to users and a few stored domain names use these non-ASCII dots,
>> >> >> some
>> >> >> may feel that we have to support them (forever).
>> >> >>
>> >> >> Having said that, I am quite concerned about adding yet another
>> >> >> non-ASCII dot in IDNAv2 (U+2CFE COPTIC FULL STOP) because IDNA2003
>> >> >> includes a flag that allows for the lookup of unassigned (in Unicode
>> >> >> 3.2) characters. Such applications would not only fail to case-fold
>> >> >> post-Unicode-3.2 characters correctly, they would fail to divide the
>> >> >> full domain name into individual labels, and since DNS labels are
>> >> >> "owned" by different owners, this just seems like an invitation to
>> >> >> further problems.
>> >> >>
>> >> >> In my view, the dot is a keyboard and UI issue. Of course, it would
>> >> >> be
>> >> >> nice if we could push ALL mappings out to the keyboard and UI, but,
>> >> >> to
>> >> >> use one of John's favorite words, this may be "unrealistic". ;-)
>> >> >>
>> >> >> Erik
>> >> >> _______________________________________________
>> >> >> Idna-update mailing list
>> >> >> Idna-update at alvestrand.no
>> >> >> http://www.alvestrand.no/mailman/listinfo/idna-update
>> >> >>
>> >> >
>> >> _______________________________________________
>> >> Idna-update mailing list
>> >> Idna-update at alvestrand.no
>> >> http://www.alvestrand.no/mailman/listinfo/idna-update
>> >
>> >
>
>