Changing the xn-- prefix

Mon Mar 24 21:49:33 CET 2008

--On Monday, 24 March, 2008 10:48 -0700 Shawn Steele
<Shawn.Steele at microsoft.com> wrote:

> John Wrote:
> 
>> I certainly agree that changing the prefix would be a terrible
>> idea...
> 
>> In particular, I would welcome suggestions from either of you
>> (or others) about how the "prefix change" material in Section
>> 9.3 of  draft-klensin-idnabis-issues-07 can be improved.
> 
> The last sentence "Consequently, a prefix change is to be
> avoided if at all possible,    even if it means accepting some
> IDNA2003 decisions about character    distinctions as
> irreversible." Should be a statement in the first part of the
> section :) eg: "A prefix change MUST be avoided if at all
> possible."

"MUST" is a term of art around the IETF and cannot be used in a
conditional statement like that.  But I can try to strengthen
that sentence.

> The "conditions requiring a prefix change" and the
> "implications of prefix changes" should perhaps be correlated
> with each other.  I would also go farther and come up with
> requirements based on these implications.
> 
> 9.3.1 1) & 2) The conversion of an A-label to Unicode (i.e., a
> U-label) yields        one string under IDNA2003 (RFC3490) and
> a different string under        IDNA200X.
> - Therefore the punycode conversion MUST remain the same for
> all code point sequences that aren't prohibited in the updated
> version.
> 
> 3) A fundamental change is made to the semantics of the string
> that        is inserted in the DNS, e.g., if a decision were
> made to try to        include language or specific script
> information in that string,        rather than having it be
> just a string of characters. - Therefore such semantics can't
> change.
> 
> 4) A sufficiently large number of characters is added to
> Unicode so        that the Punycode mechanism for block
> offsets no longer has        enough capacity to reference the
> higher-numbered planes and        blocks.  This condition is
> unlikely even in the long term and        certain not to arise
> in the next few years.
> - Therefore those characters, should they be added, will be
> illegal in the punycode forms of the names.

These conditions are, IMO, a bit too strong.  IDNA2003 made the
unfortunate decision, at least in retrospect, of requiring
lookups for unassigned codepoints, codepoints that might, in
principle, be banned by the logic of IDNA200X (either because
they are assigned to compatibility characters or, more
important, because they are assigned to upper-case characters
that would be mapped to lower-case ones (or something else) by
the new rules.  More important, it discarded the two "joiner"
characters which appear to be absolutely critical for use with a
few scripts and languages.

We believe that the changes needed to adapt to both of those
issues  --restricting lookups to assigned codepoints and
treating ZWJ and ZWNJ as valid characters (carried forward into
the punycode) -- can be accomplished without changing the
prefix.   Doing so will require that zone administrators
("registries") who wish to use them will have to think carefully
about transition strategies, but that is nothing new.

Similar comments apply to the three European characters (German
Sharp S and Greek Final Sigma and Tonos) that have caused such
extensive discussion on this list.   While the transition would
be difficult -- difficult enough that I'd expect some registries
would refuse to accept registrations in those characters whose
interpretation had changed -- it is a transition that could be
handled without a prefix change if we decided that the
advantages of accommodating those characters outweighed the
disadvantages (including the possible pain of transition).

So, while we agree about the prefix issue, I'd hate to write
rules, intentionally or not, that would prevent necessary
changes that did not require a prefix change (even if a prefix
change would make those changes locally easier while making
everything else harder).

> Additionally I should note that I consider the entire punycode
> behavior to be a hack to make a Unicode name be usable in a
> non-Unicode enabled DNS system.  Additional prefixes or
> encodings make it much more difficult for the clients to
> figure out what is the "best" behavior is for a particular
> scenario.  Should they try older prefixes?  Will that cause
> security problems such as spoofing and phishing?  If they
> don't what about the mom & pop domains that "stop working"
> because they didn't know to update their registrations?  What
> about the mom & pop shops that are too slow and someone else
> registers their updated name first?
> 
> I would prefer that any updates to this hack be along the
> lines of allowing UTF-8 in the DNS system as that is certainly
> a long term complete solution.  UTF-8 is additional much more
> in alignment with RFC 2277 which states that "UTF-8 support
> MUST be possible" (for all protocols).  IDN would then
> "merely" be the stringprep normalization, filtering and
> matching rules for such names and perhaps security guidelines
> for using such names.

Well, the easy answer to your suggestion is that the horse has
long since left the barn.  Others can explain this better than I
can (and will, I hope, contribute text for the "alternatives"
draft I started a month ago), but note that 

(i) The phishing and race condition problems with the DNS are
well-known today.  Switching from punycode to UTF-8, or having
made a UTF-8 choice in the first place, would not change them in
any qualitative way.

(ii) The fact that the DNS has different case-matching rules for
octets whose first bit is zero than for characters whose first
bit is one would yield some user-astonishing matching, where
ASCII characters in a mixed ASCII- non-ASCII label would be
matched case independently while the non-ASCII characters would
be matched only exactly. There are ways around that, but they
appear to require a DNS extension (or perhaps a new Class) and
significant work on the server.

(iii)  While it is efficient for ASCII and most western/northern
alphabets, UTF-8 is arguably pathological for East Asian scripts
and any character codes outside Plane 0 (the BMP) just because
of the number of octets required to encode a given character.
When this was discussed several years ago, the greater
information encoding efficiency of ideographic scripts was
argued to offset the difference in per-character lengths, but
the WG was not persuaded by that analysis.  Had it been
persuaded, we presumably would still have ended up with an ACE,
but it almost certainly would have been a straight hexification
encoding of UTF-8

(iv) While the IDNA200X proposal, with its "no mapping" rules,
helps move us closer, to what you are looking for, there are
tricky issues about mapping, case-folding, and normalization
(including the side-effects of systems that normalize
automatically and others that don't) that would require either
server-side validation or just about the same lookup
restrictions and handling rules that are built into the IDNA200X
proposal, so just changing the encoding from punycode to UTF-8
really would not accomplish much.

Finally, if your argument against a change in prefix is to avoid
heuristics about what to do to find a given string, changing to
UTF-8 encoding would be at least as bad, and probably worse,
than going to a different ACE prefix.

    john

> 
> - Shawn
>