Changing the xn-- prefix (was: Re: Wwhich RFCs the new work would obsolete, vs update or leave alone)

Tue Mar 18 21:04:33 CET 2008

--On Tuesday, 18 March, 2008 11:15 -0700 Mark Davis
<mark.davis at icu-project.org> wrote:

> Changing the prefix would be really nasty. For folks like us
> at Google, it is important to have a canonical form for URLs,
> and then map to the on-the-wire form. When we have a domain
> name with Unicode characters in it, which do we pick when we
> want to go out on the wire? You would really have to have very
> strict enforcement of the policy on registries that no matter
> what, where both the IDNA2003 and IDNA200x forms were both
> valid, that both lead to the same location. Is that really
> practical?

Mark, one would "merely" need to have a strict order of testing
and to be very careful. Domains that wanted to maximize
efficiency would create new ACEs to correspond to the old ones,
but, by and large, I'd expect registrations of old ACE-prefix
forms to stop once registrations of new ones started and there
was any plausible degree of support for the new protocol.
Applications doing lookups would need to support both prefix
forms, probably for a rather long time.  Beyond that, the
problems are not much worse than the updating lags that are a
normal part of DNS operation.

That does not describe a conversion that is easy, or efficient,
or quick, but it does describe one that is feasible without flag
days and without unenforceable (and unsupportable) rules about
target identities. 

It also seems to me that your question isn't quite right because
you have two choices.   One is to do what everyone else would
have to do, which is to try the "new" ACE translation and lookup
first and, if it failed, to try to "old" ACE translation and
lookup.  If one did that, there would be considerable advantage
in retaining whatever ACE form under which a document was found
(instead of, or in addition to, the native character form).
And the other would be, as I think Simon was suggesting, to
simply use the ACE form throughout, translating back to native
form only for presentation.

To me, the issue here is very similar to whether or not mapping
should be required, and I think we may have agreed to disagree
about that.  I think we are better off with an IDN model that is
as clean, and variation-free, as possible.  If we don't go all
the way to Simon's solution of always transmitting the ACE form
(and that implies "transmitting in files", not just
"transmitting for lookup"), then I believe that the only
native-character forms that should be passed around should be
U-labels or, in IDNA2003 terminology, ones that can be obtained
by applying ToUnicode to the ACE form, not the forms that can be
mapped out.

The amount of energy that has gone into debating Eszett (and
Final Sigma before it, and Eszett before that) is part of the
motivation for the "no mapping" approach, an approach that I
believe should extend all the way out to the user (with mappings
supported for compatibility purposes only).  If those
characters, and others, had merely been banned in IDNA2003, then
we would be having a relatively simple discussion now about
whether to (continue to) disallow them or to permit them as
regular characters.  That discussion would not necessarily be an
easy one, since it involves all of the issues about what is
equivalent and what is not and so on.  But the discussion would
not be complicated by the compatibility issues associated with a
change in interpretation of characters brought on by the
IDNA2003 theory of trying to give an interpretation to almost
everything (including unassigned code points... and I know I
still owe the list an explanation about that).

So, to me, while I favor having a clear explanation of what
mappings are needed to assure maximum IDNA2003 compatibility,
I'd prefer to see those mappings applied on a "try the name
unmapped and then, if that lookup fails, try mapping" basis,
rather than mapping always.  The former permits us to move
forward (and to find registrations of characters that IDNA2003
maps out if we decide to make them Protocol-Valid).  The latter,
it seems to me, could easily lock us into most of the problems
of IDNA2003 through the back door of making it impossible to
find newly-allocated characters (whether allocated via a changed
decision or via a new version of Unicode).

I am violently opposed to changing the prefix if we can possibly
avoid it (and am working on better text for "Rationale"), but I
don't think that exaggerating the problem, intentionally or
otherwise, helps us move forward.   And it appears to me that we
are likely to need at least some "try this, then try that"
logic, no matter how trivial, if there are any substantive
changes between IDNA2003 and IDNA200X.   The issue is getting
that logic right and keeping it as limited as possible, rather
than whether we have it or not.

      john