non-ASCII dots

Mon Mar 23 01:00:15 CET 2009

Hi again James, thank you for the email. I am quite aware of the dot
issues in IDNA. I have first-hand experience with Japanese input
methods and their modes, and I understand the motivation for the
addition of non-ASCII dot processing in IDNA2003.

The issue with U+2CFE COPTIC FULL STOP is a bit subtle, so let me
explain. U+2CFE was added in Unicode 4.1. This means that, from the
point of view of an IDNA2003 implementation, it is simply an
unassigned character. Let's say we have a domain name like:

aaa <U+2CFE> bbb . com

Suppose that aaa and bbb are Coptic characters, and the typist
happened to have a Coptic input method (though I have no idea whether
such things exist!). Further, let's suppose that the client is using
IDNA2003 with the flag "allow unassigned" set to true. If aaa and bbb
are already lower-case, the client will do the right thing with them
(leaving them as is). However, the client will not know that U+2CFE is
a new dot-like character, so it will treat the entire sequence
"aaa<U+2CFE>bbb" as a single label. It will then encode it in Punycode
(including the dot-like character), and try to resolve that in DNS.

Of course, this will not work because the intention was to resolve
aaa.bbb.com, not aaa<U+2CFE>bbb.com. In other words, a new client and
an old client would resolve this name differently.

I don't know how many IDNA2003 clients actually set the "allow
unassigned" flag to true. It is obviously very dangerous, since the
client cannot possibly know how to case-fold the new characters,
including Coptic.

(And this is also why Mark is wrong when he says that if clients are
allowed to lookup XN-labels with unassigned characters, then they
should also be allowed to lookup Unicode labels with unassigned
characters.)

Erik

On Sun, Mar 22, 2009 at 2:33 PM, James Seng <james at seng.sg> wrote:
> I think you misunderstood about the "dot" problem. It is not these
> "dots" are allowed as domain name but they are identified as
> "separator" like "."
>
> The main reason is to because when a user switch to CJK inputs, when
> he press ".", most IME will spur out U+3002 instead. If you do not
> identify U+3002 as a separator, then a user will have to enter CJK
> IME, switch back to English, enter a ".", switch back to CJK IME etc.
>
> See http://tools.ietf.org/html/draft-jet-idnabis-cjk-localmapping-00
>
> -James Seng
>
> On Mon, Mar 23, 2009 at 1:51 AM, Erik van der Poel <erikv at google.com> wrote:
>> Another question from the summary:
>>
>>> A. Multiple characters are allowed as "dots" in domain names under
>>> IDNA2003 and presumably under IDNAV2. This is a general problem for
>>> all versions of IDNA but may be exacerbated by the variants for "dots"
>>> that are permitted under IDNA2003 and IDNAv2. What is the WG view?
>>
>> In my view, non-ASCII dots should never have been allowed in IDNA2003.
>> However, now that many IDNA2003 implementations have been distributed
>> to users and a few stored domain names use these non-ASCII dots, some
>> may feel that we have to support them (forever).
>>
>> Having said that, I am quite concerned about adding yet another
>> non-ASCII dot in IDNAv2 (U+2CFE COPTIC FULL STOP) because IDNA2003
>> includes a flag that allows for the lookup of unassigned (in Unicode
>> 3.2) characters. Such applications would not only fail to case-fold
>> post-Unicode-3.2 characters correctly, they would fail to divide the
>> full domain name into individual labels, and since DNS labels are
>> "owned" by different owners, this just seems like an invitation to
>> further problems.
>>
>> In my view, the dot is a keyboard and UI issue. Of course, it would be
>> nice if we could push ALL mappings out to the keyboard and UI, but, to
>> use one of John's favorite words, this may be "unrealistic". ;-)
>>
>> Erik
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>