looking up XN-labels with unassigned characters

Erik van der Poel erikv at google.com
Mon Mar 23 18:09:26 CET 2009


Hi Mark,

I was referring to your comment:

"if it is important to check those requirements, then it is important
to test both A and U Labels; if it is not important to test them, then
it should not be a requirement for either one."

All I'm saying is that it would be nice if a client could lookup any
label that starts with "xn--". This would allow e.g. Google Search to
convert legal U-labels to legal A-labels, and it would allow the
browser to follow the links, even if the browser implemented an old
version of IDNA.

It would then be up to the browser to display the underlying Unicode
string in a safe way. Note that browsers are already expected to apply
certain restrictions when displaying the domain name.

I do agree with your point that, now that we are at Unicode 5.1,
future additions to Unicode are not quite as impactful to the DNS.
We're in the "long tail".

Erik

On Mon, Mar 23, 2009 at 9:38 AM, Mark Davis <mark at macchiato.com> wrote:
> First off, I have not been pushing for allowing UNASSIGNED on lookup in
> IDNA2008. This is for two reasons:
>
> We have had many Unicode versions since 3.2, so the urgency is not as
> prominent
> Because IDNA2008 updates more regularly, there is less need.
>
> What I have been saying is that allowing UNASSIGNED on lookup wouldn't make
> a difference, and that's the case even if a character maps to ".".
>
> Let's take a specific example: àbc͸dèf.com, where the middle character,
> \u0378, is currently unassigned as far as the client is concerned (because
> it is back-reved), while the registry is on Unicode 6.0. The XN form is
> xn--bcdf-zna5c481a.com.
>
> Here's what happens when the client software (browser, emailer, etc) looks
> the domain name up, depending on what \u0378 turns into under 6.0.
>
> \u0378 becomes DISALLOWED. No problem. No conformant registry can support
> it, even on Unicode 6.0; the lookup is denied.
> \u0378 becomes PVALID. No problem - the lookup works.
> \u0378 becomes mapped to X (assuming we allow mapping on lookup)
>
> X is DISALLOWED, say "$".  No problem. No conformant registry can support
> it, even on Unicode 6.0; the lookup is denied.
> X is PVALID, say "X". The lookup fails. The remapped domain name would work
> as xn--bcxdf-qqa4d.com, but the original URL would not work until the client
> is updated, or unless the user learns to type X instead until s/he updates
> his/er client.
> X is ".". The lookup fails. The remapped domain name would work as
> xn--bc-iia.xn--df-7ia.com, but the original URL would not work until the
> client is updated, or unless the user learns to type X instead until s/he
> updates his/er client.
>
> Whether the character maps to a dot or not in Unicode 6.0 doesn't make any
> difference in the scenario. It just fails the lookup in a different way (3.3
> instead of 3.2), but the lookup fails in either case.
>
> Mark
>
> On Sun, Mar 22, 2009 at 17:00, Erik van der Poel <erikv at google.com> wrote:
>>
>> Hi again James, thank you for the email. I am quite aware of the dot
>> issues in IDNA. I have first-hand experience with Japanese input
>> methods and their modes, and I understand the motivation for the
>> addition of non-ASCII dot processing in IDNA2003.
>>
>> The issue with U+2CFE COPTIC FULL STOP is a bit subtle, so let me
>> explain. U+2CFE was added in Unicode 4.1. This means that, from the
>> point of view of an IDNA2003 implementation, it is simply an
>> unassigned character. Let's say we have a domain name like:
>>
>> aaa <U+2CFE> bbb . com
>>
>> Suppose that aaa and bbb are Coptic characters, and the typist
>> happened to have a Coptic input method (though I have no idea whether
>> such things exist!). Further, let's suppose that the client is using
>> IDNA2003 with the flag "allow unassigned" set to true. If aaa and bbb
>> are already lower-case, the client will do the right thing with them
>> (leaving them as is). However, the client will not know that U+2CFE is
>> a new dot-like character, so it will treat the entire sequence
>> "aaa<U+2CFE>bbb" as a single label. It will then encode it in Punycode
>> (including the dot-like character), and try to resolve that in DNS.
>>
>> Of course, this will not work because the intention was to resolve
>> aaa.bbb.com, not aaa<U+2CFE>bbb.com. In other words, a new client and
>> an old client would resolve this name differently.
>>
>> I don't know how many IDNA2003 clients actually set the "allow
>> unassigned" flag to true. It is obviously very dangerous, since the
>> client cannot possibly know how to case-fold the new characters,
>> including Coptic.
>>
>> (And this is also why Mark is wrong when he says that if clients are
>> allowed to lookup XN-labels with unassigned characters, then they
>> should also be allowed to lookup Unicode labels with unassigned
>> characters.)
>>
>> Erik
>>
>> On Sun, Mar 22, 2009 at 2:33 PM, James Seng <james at seng.sg> wrote:
>> > I think you misunderstood about the "dot" problem. It is not these
>> > "dots" are allowed as domain name but they are identified as
>> > "separator" like "."
>> >
>> > The main reason is to because when a user switch to CJK inputs, when
>> > he press ".", most IME will spur out U+3002 instead. If you do not
>> > identify U+3002 as a separator, then a user will have to enter CJK
>> > IME, switch back to English, enter a ".", switch back to CJK IME etc.
>> >
>> > See http://tools.ietf.org/html/draft-jet-idnabis-cjk-localmapping-00
>> >
>> > -James Seng
>> >
>> > On Mon, Mar 23, 2009 at 1:51 AM, Erik van der Poel <erikv at google.com>
>> > wrote:
>> >> Another question from the summary:
>> >>
>> >>> A. Multiple characters are allowed as "dots" in domain names under
>> >>> IDNA2003 and presumably under IDNAV2. This is a general problem for
>> >>> all versions of IDNA but may be exacerbated by the variants for "dots"
>> >>> that are permitted under IDNA2003 and IDNAv2. What is the WG view?
>> >>
>> >> In my view, non-ASCII dots should never have been allowed in IDNA2003.
>> >> However, now that many IDNA2003 implementations have been distributed
>> >> to users and a few stored domain names use these non-ASCII dots, some
>> >> may feel that we have to support them (forever).
>> >>
>> >> Having said that, I am quite concerned about adding yet another
>> >> non-ASCII dot in IDNAv2 (U+2CFE COPTIC FULL STOP) because IDNA2003
>> >> includes a flag that allows for the lookup of unassigned (in Unicode
>> >> 3.2) characters. Such applications would not only fail to case-fold
>> >> post-Unicode-3.2 characters correctly, they would fail to divide the
>> >> full domain name into individual labels, and since DNS labels are
>> >> "owned" by different owners, this just seems like an invitation to
>> >> further problems.
>> >>
>> >> In my view, the dot is a keyboard and UI issue. Of course, it would be
>> >> nice if we could push ALL mappings out to the keyboard and UI, but, to
>> >> use one of John's favorite words, this may be "unrealistic". ;-)
>> >>
>> >> Erik
>> >> _______________________________________________
>> >> Idna-update mailing list
>> >> Idna-update at alvestrand.no
>> >> http://www.alvestrand.no/mailman/listinfo/idna-update
>> >>
>> >
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>


More information about the Idna-update mailing list