Making progress on the mapping question

Mon Mar 30 17:19:55 CEST 2009

--On Monday, March 30, 2009 07:41 -0400 Vint Cerf
<vint at google.com> wrote:

> There has not been any significant objection to the proposals
> made   during the IETF 74 meeting to apply some form of
> mapping during   lookup. The two questions outstanding are:
> 
> 1. what mapping function should be used?
> 2. how should it be used
> 
> As Harald and others have observed, if it is applied before an
>  IDNA2008-style lookup, we will not find new characters
> permitted under   IDNA2008 if they happen to be mapped under
> IDNA2003. This seems to   argue for:
> 
> 1. first look up under IDNA2008 rules
> 2. If a domain name is found, return the corresponding results
> 3. If a domain name is not fund, apply IDNA2003 mapping
> 4. If a domain name is found, return the results
> 5. If a domain name is not found, report that no such domain
> name exists
> 
> One final point. It seems to me that we should put the
> IDNA2003   mapping function into stasis, making no future
> changes to it, and use   the IDNA2008 framework to accommodate
> any new additions into Unicode   versions as they are
> released. Assuming we have ample warning of a new   version,
> we can even prepare tables suited to the new release ahead of  
> time so as to have them available at the point where a new
> version of   Unicode is adopted.
> 
> Could the WG please analyze this proposition, point out flaws
> and   suggested corrections for them?

Vint,

Sorry for my silence since Tuesday.  IAB and related activities
took up my time during the last half of last week and I needed
to spend the weekend reading all of the list traffic and trying
to work through case analyses, especially those implied by (c)
below.

I believe that the proposition above is correct and that almost
any other arrangement would take us back to IDNA2003, wiping out
the most of the other substantive changes we've made with
IDNA2008.

Some additional clarifications and questions:

(a) I believe we should adopt the M-label terminology that
Patrik suggested some time ago.  If there are no objections,
I'll work it into the next version of Defs.

(b) "If a domain name is found" turns out to be slightly
ambiguous because one can have either "no label at that node" or
"label found, but no record of the type requested".  I believe
that the first interpretation is the correct one because, if the
two A-labels are different and the 2008-related one appears, any
appearance of that label would indicate that the zone, or at
least that label, are IDNA2008-capable.  That would mean that
finding a label at the node, regardless of whether the
particular QTYPE has any value, would stop the search.  In
either event, I need advice from those who are more immersed in
current preferred DNS terminology than I am to suggest the exact
words I should use in Protocol.

(c) The above would imply that we apply _all_ IDNA2003 mappings
and lookups if the IDNA2008 lookup of (1) fails.  I do not
believe that is actually our intent.   Consider the string
"┌┐└┘" (U+250C U+2510 U+2514 U+2518).  An IDNA2008
conversion fails completely because all four characters are
DISALLOWED.  If we then apply the IDNA2003 mappings and Punycode
conversion, we get "xn--lwhimq", which could be looked up.    An
example that is graphically even more interesting would be
"□□□" (U+25A1 U+25A1 U+25A1), which looks suspiciously
like the "no available font/graphic" indicator in many systems.
It, too, is DISALLOWED by IDNA2008 but, for IDNA2003, is
successfully converted by ToASCII to "xn--u0haa".

One might assume that the proper test is to determine whether
any of the characters in the string is DISALLOWED under
IDNA2008, but that would reject, e.g., "АБВ" (U+0410 U+0411
U+0412), which we presumably want to permit for compatibility.
While that case is easily discovered and permitted, it is less
obvious to me how to differentiate between those characters
(such as the punctuation and symbols) that IDNA2008 intends to
disallow and those that are appropriately mapped.  It appears to
me that, if one applied the IDNA2003 mapping of your Step 3 and
then tested to be sure that no character in the resulting string
was DISALLOWED or failed a required CONTEXT test under IDNA2008,
we would be most of the way there, but I've been unable to
convince myself that check is both necessary and sufficient.

Finally, if we are going to start making specific tests for what
can be mapped and what cannot, it might be worth making a pass
through the various characters mappped by NFKC and their other
properties to see if some filtering would be reasonable.  I
don't know whether we need all of them or whether some might
remain disallowed (on the other hand, I don't know which of them
might be sufficiently harmful or confusing to be worth extra
trouble to ban).

thanks,
    john