Nameprep and NFKC

Thu Oct 14 00:40:04 CEST 2010

------------- Begin Forwarded Message -------------

Date: Wed, 13 Oct 2010 15:38:15 -0700 (PDT)
From: Kenneth Whistler <kenw at atlantis-new.sybase.com>
Subject: RE: Nameprep and NFKC
To: klensin at jck.com
Cc: idna-update at lavestrand.no, aghadir at citc.gov.sa, kenw at birdie.sybase.com
Content-MD5: /NzXA/Q8BXFRVZTVsbaI/g==

John Klensin responded to Shawn Steele's comment:

> > Many people would use Unicode UTS#46 in addition to the
> > IDNA2008 RFCs:
> > 
> > http://www.unicode.org/reports/tr46/ 
> >...
> 
> Yes, and some of the tables Abdulrahman's note referred to are
> dependent on UTR 46.  But this is exactly where mapping gets us
> into trouble:
> 
> -- Use of UTR 46 with IDNA2008 reduces some incompatibilities
> with IDNA2003 and may cause a few others.   If I correctly
> understand Abdulrahman's example, it may be one of those
> incompatibilities.

It is not.

Abdulrahman's example is a straightforward case of NFC
normalization.

0623 is canonically equivalent to the sequence <0627, 0654>
and by the rules of Unicode normalization, a string <0623>
is in NFC form, but the canonically equivalent string 
<0627, 0654> is not.

As Patrik pointed out, "IDNA2008 require the string to be in NFC form".
More specifically, Section 5.2 says the result of the processing
Section 5.2 MUST be Unicode string in NFC form. And John also
echoed this:

> -- IDNA2008 itself (RFC 5890-5893) are very clear that input
> (and U-labels) must already be in NFC form.  

So Abdulrahman's example and question indicate to me a
problem with an implementation -- not with either the
RFC's or mapping.

Using a string <..., 0627, 0654, ...> as input to IDNA2008
processing should simply result in an error, not a different
(but presumed valid) punycode as the same string
mapped to <..., 0623, ...>.

But as a response to Shawn's comment, the following seems
to me to be non sequitur:

> -- While "many people" would use UTR 46, "many people" may use
> RFC 5895 instead, or no mapping at all.  They are not compatible.

The way I think Abdulrahman's example should be taken is
that local expectation (and every Unicode implementer's
expectation, for that matter) is that an input string
<..., 0627, 0654, ...> and an input string <..., 0623, ...>
SHOULD be taken as behaving equivalently, because they
are canonically equivalent, after all. You cannot satisfy
that local expectation by simply feeding both strings to
a conformant implementation of RFC 5891 unchanged, as noted
above. You would get an error for the input which is not
in NFC form.

But the answer here is to be found in the rationale document,
RFC 5894:

"In principle, an application ought to take user input of a
domain name and convert it to the set of Unicode code points
that represent the domain name the user intends. As a practical
matter, of course, determining user intent is a tricky business,
so an application needs to choose a reasonable mapping from
user input. That may differ based on the particular
circumstances of a user, dpending on locale, language, type
of input method, etc. It is up to the application to make
a reasonable choice."

In the case of canonically equivalent strings, however,
determining user intent is *NOT* a tricky business, and
the "reasonable choice" that every application should make
is to normalize the input strings ("map", if you like) to
NFC form, *before* processing according to RFC 5891. To
do otherwise would be to guarantee the kind of
head-scratching query that Abdulrahman's example presented.

That is part of the pre-processing specified in UTS #46,
and it is the non-tricky, obvious, consensual part of
that pre-processing. It doesn't even touch upon the
more difficult issue of the transitional incompatibilities
between IDNA2003 and IDNA2008.

Incidentally, the mapping recommended in RFC 5895, while
not dealing with any of the tricky transition issues,
also clearly recommends:

  "3. All characters are mapped using Unicode Normalization
      Form C (NFC). ..."

As I said, this is the obvious, consensual part of the
problem -- and doesn't venture at all into the murkier
waters of language- or locale-specific DWIM issues.

--Ken

------------- End Forwarded Message -------------