U-labels, NFC, and symmetry

Peter Saint-Andre stpeter at stpeter.im
Fri Apr 15 15:19:34 CEST 2011


On 4/8/11 6:37 AM, John C Klensin wrote:
> 
> 
> --On Friday, April 08, 2011 08:07 -0400 Andrew Sullivan
> <ajs at shinkuro.com> wrote:
> 
>> On Thu, Apr 07, 2011 at 03:59:05PM -0600, Peter Saint-Andre
>> wrote:
>>
>>> to do so without requiring a trip through NFC. However, it
>>> appears that we can do this only by using a term other than
>>> U-label, since that is tied to NFC.
>>
>> Yes.  U
>>
>>> Indeed, it seems that a string in Unicode NFD normalized
>>> form is not an IDN label at all.
>>
>> Correct.
>>
>>> This strikes me as unfortunate (I
>>> thought that normalization was handled only in RFC 5895 along
>>> with other such mapping issues), but probably because I do
>>> not understand how the symmetry requirement expressed in RFC
>>> 5890 necessitates the use of NFC.
>>
>> I suppose in principle it could have used NFD instead, but it
>> needed to be one or the other, because the result in principle
>> ought to be binary equivalent.  I forget, however, exactly why
>> we decided to prefer NFC (if I ever knew).
> 
> Peter, borrowing from a note I sent you privately yesterday (now
> that Andrew, Vint, and Martin have laid the foundation...
> 
> NFC was chosen early in the IDNA2003 cycle because it generally
> produces results that a more compact and more intuitive for
> people using systems within their own environment. Using the
> notorious "o with diaeresis" as an example, NFC produces the
> single-character form (U+00F6), which also appears as F6 in ISO
> 8859-1/Latin-1, etc.  NFD, by contrast, produces "o plus
> combining diaeresis" and, in general (probably without
> exception, but I can't trust my memory and can't check right
> now) the length of NFD strings is equal to or longer than their
> NFC equivalents.   As you presumably know, NFD has some
> processing and predictability advantages, but naturalness in
> normal environments is not one of them.   
> 
>>> In the meantime, I shall pursue a way to specify XMPP
>>> domainparts independently of the term U-label.
>>
>> Hrm.  Don't the domainparts have to be usable in domain name
>> slots? If so, then specifying as NFD means that they must
>> _always_ be transformed to be used as part of an IDNA lookup
>> (or to go into the IDNA2008 transformation, because before you
>> get to that you have to have a U-label).  Are you sure that's
>> what you want?
> 
> Yes.  To repeat what was said in the other notes, whether the
> original decision to go with NFC was optimal or not, there are
> disadvantages of being different that I'd hope XMPP would review
> carefully.  U-labels and IDN domain slots are NFC (as were
> IDNA2008 labels), RFC 5198 recommends NFC and, as we know,
> strings get copied and moved around by dumb software; NFC is
> what one gets from most raw typing that goes directly to
> Unicode; and, because of history and the preference for
> precomposed characters in most of the 8bit CCSs (ISO 8859 and
> friends), most conversions from other systems to Unicode tend to
> yield NFC.  So, it is isn't too late, please review this and
> decide whether you really need NFD.
> 
> If there is real conformity with the rules, an XMPP standard
> that designates NFD should have that explanation anyway because
> use of NFC violates a SHOULD in RFC 5198.

It's certainly not too late. Given the SHOULD in RFC 5198, the XMPP
community might well end up conforming to the NFC recommendation.
However, in the interest of understanding the problem space, please
allow me to explain the reasoning a bit further.

In XMPP, we compare identifiers a lot more often than we perform DNS
lookups. NFC results in strings that are slightly more compact than
those produced by NFD, and I can see why that's useful for DNS lookups,
but the space constraints of DNS are mostly immaterial for XMPP. What we
care about most is processing requirements, and in current XMPP servers
stringprep processing is a huge hotspot. The main reason is that an XMPP
server needs to compare identifiers all the time. For example, a server
is required to check every "stanza" (message) that it receives from a
peer server to ensure that the 'from' address is valid. It does this by,
among other things, comparing the domainpart of the 'from' address with
the domainpart it has validated for the long-lived XML stream over which
it received the stanza. The server needs to perform a DNS lookup only
once (at the time the stream is established), whereas it needs to
perform a string-compare for every incoming stanza to prevent address
spoofing. (I would think that email servers, or associated processes
such as procmail, need to perform similar operations, but perhaps that
hasn't hit home yet in the email community because i18n addresses are
not yet widespread.) Given that a long-lived XML stream in XMPP might be
up for days or weeks or months in the case of server-to-server
federation, the ratio of string compares (potentially using NFD, but
right now using NFKC in our stringprep approach from RFC 3920) to DNS
lookups (the only time NFC would be required) approaches infinity.

As you can see, being able to reduce the processing requirements for
string compares turns out to matter quite a bit in the XMPP world. That
doesn't mean we are wedded to NFD (and all of this discussion is
exploratory right now anyway), but I wanted to gain a better
understanding of why NFC is considered necessary. Unfortunately I still
don't have that understanding because I don't know what the symmetry
argument really means in RFC 5890:

   To be valid, U-labels and A-labels must obey an important symmetry
   constraint.  While that constraint may be tested in any of several
   ways, an A-label A1 must be capable of being produced by conversion
   from a U-label U1, and that U-label U1 must be capable of being
   produced by conversion from A-label A1.  Among other things, this
   implies that both U-labels and A-labels must be strings in Unicode
   NFC [Unicode-UAX15] normalized form.

How is it that NFC meets the symmetry requirement, but NFD does not?

Thanks for your patience. :)

Peter

-- 
Peter Saint-Andre
https://stpeter.im/



-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6105 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20110415/8afee296/attachment.bin>


More information about the Idna-update mailing list