U-labels, NFC, and symmetry
John C Klensin
klensin at jck.com
Fri Apr 8 14:37:09 CEST 2011
--On Friday, April 08, 2011 08:07 -0400 Andrew Sullivan
<ajs at shinkuro.com> wrote:
> On Thu, Apr 07, 2011 at 03:59:05PM -0600, Peter Saint-Andre
>> to do so without requiring a trip through NFC. However, it
>> appears that we can do this only by using a term other than
>> U-label, since that is tied to NFC.
> Yes. U
>> Indeed, it seems that a string in Unicode NFD normalized
>> form is not an IDN label at all.
>> This strikes me as unfortunate (I
>> thought that normalization was handled only in RFC 5895 along
>> with other such mapping issues), but probably because I do
>> not understand how the symmetry requirement expressed in RFC
>> 5890 necessitates the use of NFC.
> I suppose in principle it could have used NFD instead, but it
> needed to be one or the other, because the result in principle
> ought to be binary equivalent. I forget, however, exactly why
> we decided to prefer NFC (if I ever knew).
Peter, borrowing from a note I sent you privately yesterday (now
that Andrew, Vint, and Martin have laid the foundation...
NFC was chosen early in the IDNA2003 cycle because it generally
produces results that a more compact and more intuitive for
people using systems within their own environment. Using the
notorious "o with diaeresis" as an example, NFC produces the
single-character form (U+00F6), which also appears as F6 in ISO
8859-1/Latin-1, etc. NFD, by contrast, produces "o plus
combining diaeresis" and, in general (probably without
exception, but I can't trust my memory and can't check right
now) the length of NFD strings is equal to or longer than their
NFC equivalents. As you presumably know, NFD has some
processing and predictability advantages, but naturalness in
normal environments is not one of them.
>> In the meantime, I shall pursue a way to specify XMPP
>> domainparts independently of the term U-label.
> Hrm. Don't the domainparts have to be usable in domain name
> slots? If so, then specifying as NFD means that they must
> _always_ be transformed to be used as part of an IDNA lookup
> (or to go into the IDNA2008 transformation, because before you
> get to that you have to have a U-label). Are you sure that's
> what you want?
Yes. To repeat what was said in the other notes, whether the
original decision to go with NFC was optimal or not, there are
disadvantages of being different that I'd hope XMPP would review
carefully. U-labels and IDN domain slots are NFC (as were
IDNA2008 labels), RFC 5198 recommends NFC and, as we know,
strings get copied and moved around by dumb software; NFC is
what one gets from most raw typing that goes directly to
Unicode; and, because of history and the preference for
precomposed characters in most of the 8bit CCSs (ISO 8859 and
friends), most conversions from other systems to Unicode tend to
yield NFC. So, it is isn't too late, please review this and
decide whether you really need NFD.
If there is real conformity with the rules, an XMPP standard
that designates NFD should have that explanation anyway because
use of NFC violates a SHOULD in RFC 5198.
More information about the Idna-update