Tonus

Thu Jan 31 21:08:11 CET 2008

Simon,

The accident of a few transcontinental plane flights and an
intense meeting has mostly kept me out of this discussion, but
your comment deserves some followup that I haven't seen...

--On Thursday, 31 January, 2008 13:55 +0100 Simon Josefsson
<simon at josefsson.org> wrote:

> What isn't clear in this thread is that the _reason_ IDNA
> works the way it does is because it chose to use Unicode NFKC
> for normalization.  That isn't something that the Unicode
> specifications required IDNA to do.  I recall discussions of
> which Unicode normalization form to use in the IPR WG, and the
> eventual choice of NFKC was deliberate.  That may or may not
> have been the right choice, but that's water under the bridge.
> So if I understand correctly, to fix this issue, we would need
> to replace NFKC with something else in IDNAbis.

Well, actually the effect on tonus is an NFKC issue, the final
sigma one is a casefolding issue, but both are driven, as you
indicate, by very explicit decisions.  The particular decisions
that were made because of the problem that Patrik describes in
terms of three alternatives.  While I would have described
things differently, a DNS lookup ultimately requires a decision
as to whether two strings match.  To make that decision, the
server can either do a bit-identity comparison on the strings or
needs enough knowledge of the characters to make a more subtle
comparison.  

For ASCII strings, but only for ASCII strings, the DNS does one
of those more subtle comparisons -- a case-insensitive match --
and has done so since the DNS was designed.

If one is going to rely on that ASCII case-insensitive
comparison on the servers and not require server changes for
IDNs, as IDNA does, then one has to rely on mapping operations
on the client to simulate what one would otherwise expect the
server to do by matching.  And that leads to a decision point,
which is really nothing new for us:

	* One can decide that every character and every
	character representation, is itself and that, e.g.,
	upper and lower case forms of the same character are
	treated as not matching.  Outside the DNS context, UNIX
	and its direct descendants do that. Windows does not.
	Both have their supporters and advantages and
	disadvantages.

	* One can decide that some pairs of characters are
	really "the same" and map them together.

The difficulty with the second is the information loss about
what the original character was, which is what I was trying to
describe in my earlier, too-hastily-written, note.  But the only
alternative to that loss is matching on the server (you will
recall that the WG explicitly rejected at least one proposal,
from me, that would have permitted server-side matching,
partially for reasons similar to this one).  Unless one matches
on the server, either a pair of code points are the same or they
are different.  They can't be different and still match.

Certainly, the WG could have taken a middle position, e.g.,
applied NFC or MFD but not NFKC and casefolding.  As you say,
they explicitly decided to do what they did.  The other choice
then would have have some advantages but caused a different set
of problems than the ones we are now identifying.  The WG
examined the tradeoffs and made a fairly informed decision about
which set of problems to live with.   Making the change to a
different model would be massively disruptive of existing
registrations, but moving the mappings out of the protocol in
IDNA200X is very much intended to permit (and encourage) more
flexibility on the part of implementations to do mappings that
are appropriate locally.

> (Fwiw, for non-DNS purposes of string preparation, the choice
> of NFKC is not so clearly the best choice.)

Of course,  If NFKC were always appropriate, there would have
been no reason to have the compatibility characters in Unicode
in the first place.   You will notice that IDNA200X uses NFKC
only to classify characters but not in the protocol, that
draft-klensin-net-utf8 does not mandate it use at all, and so
on.  It is a pretty drastic mapping in many ways, but --at least
in the view of the original IDN WG-- appropriate in the context
of a "should these characters match" question.

best,
     john