Standards and localization (was Dot-mapping)

Wed Dec 12 22:02:02 CET 2007

--On Tuesday, 11 December, 2007 16:59 +0000 Gervase Markham
<gerv at mozilla.org> wrote:

> Erik van der Poel wrote:
>> Maybe I should not have focussed on the spoofing examples in
>> my previous email. This is not only a security issue. It is an
>> interoperability issue too. We have a number of possibilities
>> for IDNA200X:
>> 
>> (1) make the mappings (dots, case, nfkc) part of the protocol
>> (2) make them a normative reference
>> (3) make them an informative reference
>> (4) don't reference them at all
> 
> Although surely it's possible to make this decision
> differently for the case of dots and for everything else?

I have a deep distaste for special cases when they can be
avoided, so the first proposal you will always see from me is
for no special cases.  But, that said, if there is consensus
that any given special case is worth making and living with
forever, we can do it.  I think that argument should be
especially strong if the special case is either unstable (i.e.,
it might need to be updated with future versions of Unicode) or,
worse, if it left us needing to defend a position like "their
dots get special treatment because their script was coded by
Unicode 3.2 (or Unicode 5.0) and yours don't because your script
was coded later".  In addition to simple matters of cultural
equity, that would be a nasty position to be caught in given the
current politicized environment surrounding IDNs.

That comment applies very generally.  In principle, we could
preserve all of the mappings of IDNA2003 (with the possible
exception of the dot one, see below) and apply this new work
only for Unicode 4.0 and beyond.  That would still give us the
all-important version agility.  It would simply require that
applications (both registration and lookup) look each character
up in a table to determine whether it existed in Unicode 3.2 and
then to apply different rules depending on the answer.  It isn't
clear to me at the moment what we would do with a label that
contained a mix of Unicode 3.2 and later characters but, given
that such labels should be fairly rare (since characters are
normally added by script, rather than individually), I'm sure we
could sort something out if we were motivated.  As a
sometime-implementer, especially when I consider embedded
implementations on small devices, I find the complexity of such
a model fairly frightening, but I want to stress that we could
do it if it were generally judged to be important enough (just
as we could have coded language information into IDNs if we
thought that was sufficiently important).

> As John says, dot is special, because it's the delimiter.

But that is the problem.  If we could somehow write the
specification precisely enough that dots other that ASCII period
(full stop) would never appear except in contexts in which IDNA
processing would be applied, we could accept an "old dots get
special treatment and new ones don't" rule, and we could
guarantee that they would never leak, we might be able to get
away with dot-variants as part of the protocol.  But the
existing IDNA2003 text is not well enough written.  It specifies
how those dots are handled during IDNA processing, but not how
they would be handled if no ACE ("punycode" with prefix) or
non-ASCII labels appear, much less how they would be handled by
domain name processing that does not use IDNA at all.  Worse, we
know that leaks of strings from one environment to another
cannot be prevented and we know that from far more years of
experience than we have with the much less important requirement
that ACE strings not be displayed to users except under special
circumstances.

> Could IDNA200x specify a list of dot-like codepoints which
> MUST be mapped to dot, but not say anything about case and so
> on?

Could we?  Yes, see above.  

Should we?  I'd still argue that we should not.  Martin has
pointed out that these dot-variants cannot appear in IRIs, so
that is one important context in which they are not permitted
regardless of what we do.  If the list is not extensible, we run
into problems with scripts that have not been coded but whose
users believe that their dots are equally important.   We can't
impose rules on applications that do not support IDNA not matter
how strongly we right those rules.   

And, even if those were not issues, we would need to do a lot of
work to make sure the rules about processing and processing
order are very, very, precise.   To take just one example,
service location records use a format in their left-most label
or two that excludes them from IDNA interpretation or
processing.  To use some crude transliterations for illustration
purposes, 
    _xn--mxa1be.example.com
is manifestly invalid for several reasons, but it is not clear
whether
   _tcp.xn--mxahbxey0c.com
is valid as a representation of _tcp.εχαμπλε.com, or if
the second label is just a string that cannot be interpreted as
an IDN.  If one picks the latter (and I read the "slot" text in
RFC 3490 to say exactly that), one is led into a variety of
strange states, but you can perhaps see what the use of
non-ASCII dots in either would lead to.

> I must confess that my attention to IDN topics has wandered of
> late, so in diving back in, I want to issue a pre-emptive
> apology if I suggest something which has already been rejected
> for good reason.

No, this dot issue is fairly new.    I blundered into it while
trying to think through something else.

My guess is that what is needed here is a more explicit version
of something we've had for email local parts for years.  In that
case, the standard is extremely permissive. Quoted strings with
embedded blanks and control characters are permitted, addresses
are case-sensitive, etc.  But we tell people that, in the
interest of interoperability, they had better not depend on any
of those things and should design things at the receiving server
so that addresses are case-insensitive and do not contain /
depend on anything tricky.   I think what is needed here is not
a part of the IDNA standard but a separate document that says, 

	If you are an implementer of something that calls a
	resolver and are in your right mind, you will accept
	anything plausible, including these explicit
	recommendations (the IDNA2003 list goes here), as a dot,
	nothing that they are all prohibited (as punctuation) in
	IDN labels.  you will do case mapping for any script for
	which that is defined (with a pointer to the Unicode
	mapping table), and so forth.  You will immediately map
	those to the target characters and make whatever
	decisions seem sensible to you about warning the user
	about what you have done (such decisions might range
	from silence to 'do you want me to do this' messages).

	And any registry in its right mind (including zones deep
	in the tree) should accept only those strings for
	registration that can be reversed-mapped from A-labels
	back to the U-labels, i.e., with no mappings from one
	character to another at all.

I don't know what is in "and so forth", but I imagine that other
cases will be discovered.  

I'd be happy to work with others on such a document but don't
want to take the lead on it myself.  The reason for that is not
substantive but is just because my doing so would result in too
much delay.

     john