U-labels, NFC, and symmetry

Fri Apr 15 18:01:34 CEST 2011

--On Friday, April 15, 2011 08:33 -0600 Peter Saint-Andre
<stpeter at stpeter.im> wrote:

>> This is saying that conversion always produces output that is
>> in NFC. If your question is why you can't replace NFC by NFD
>> in the text above then that is because conversion can produce
>> output that is in NFC but not in NFD. So, "because IDNA
>> requires NFC".
> 
> Right. It's not that all internationalization technologies
> require NFC based on some abstract symmetry requirements, only
> that IDNA2008 requires NFC. (However, other technologies might
> want to use NFC for consistency with IDNA2008 and RFC 5198 --
> that's a separate question.)

Peter,

This will be an incomplete answer -- more when I have more
screen time, or maybe others can fill in, but...

First, the decision to use NFC, rather than NFD, in IDNA and
elsewhere is largely arbitrary.  Not quite arbitrary: the
compactness issues count (although more in some cases than
others) and the "similar to what comes off keyboards" counts,
IMO, a bit more.  But, in the last analysis, a choice had to be
made and the choice was NFC.

A better way to say "IDNA requires NFC" is that conversion of a
U-label to a standard Unicode encoding form produces an NFC
string.  It does that because the Punycode algorithm doesn't
normalize, i.e., Punycode(NFC-string) with not be
bit-string-equal to Punycode(NFD-string) unless NFC-string is
bit-string-equal to NFD-string.  I can't remember whether it was
discussed explicitly or not, but therein lies one of the other
reasons for choosing NFC for IDNA: part of the purpose for
developing Punycode was to have an encoding form that was
compact (relative to, e.g., UTF-8) for strings of characters
that were close together in the Unicode table.  Characters built
up from base and combining forms are intuitively less likely to
be compact than precomposed ones (I haven't done distance
calculations on the code point value relationships between sets
of base characters and relevant combining forms, but obviously
single code point characters are at lower risk).  Of course, as
you point out, none of that has much to do with XMPP.

However, I don't understand your argument about comparisons.
You need to compare and need to compare frequently.  Both NFD
and NDC are canonical forms.  Comparing a pair of NFC-strings is
no more expensive or complex than comparing a pair of
NFD-strings (actually, because of the length issue, the NFC
comparisons might be a tad cheaper, but the difference is
presumably similar).  If you have a stored string in NFD, an
input string needs to be converted to NFD to be compared with
it. If you have a stored string in NFC, an input string needs to
be converted to NFC to be compared with it.  No significant
difference there.  I do remember Mark Davis telling the IDNABIS
WG that testing for NFC conformance was appreciably less costly
than converting to NFC.  I don't know if the same relationship
would hold for NFD but, again, if the string to be compared to a
stored one comes from a keyboard, it is more likely to be in NFC
form than in NFD form (if an operating system decides to
normalize keyboard input before delivering it to an application,
all bets are off).

So, while at least some of the particular concerns that drove
the NFC decision for IDNA don't apply to XMPP, it still seems to
me that you haven't made any real case for NFD rather than NFC.
If the choice really is arbitrary, then being different is not
an advantage.

   john