Nameprep and NFKC

Sat Oct 16 11:55:11 CEST 2010

So basically is that a flaw on the idna2008 demo or is it common
behavior of the IDNA2008 or is it normalization problem (NFC) ? 

-----Original Message-----
From: Kenneth Whistler [mailto:kenw at sybase.com] 
Sent: 14/Oct/2010 1:38 AM
To: klensin at jck.com
Cc: idna-update at lavestrand.no; Abdulrahman I. ALGhadir; kenw at sybase.com
Subject: RE: Nameprep and NFKC

John Klensin responded to Shawn Steele's comment:

> > Many people would use Unicode UTS#46 in addition to the
> > IDNA2008 RFCs:
> > 
> > http://www.unicode.org/reports/tr46/ 
> >...
> 
> Yes, and some of the tables Abdulrahman's note referred to are
> dependent on UTR 46.  But this is exactly where mapping gets us
> into trouble:
> 
> -- Use of UTR 46 with IDNA2008 reduces some incompatibilities
> with IDNA2003 and may cause a few others.   If I correctly
> understand Abdulrahman's example, it may be one of those
> incompatibilities.

It is not.

Abdulrahman's example is a straightforward case of NFC
normalization.

0623 is canonically equivalent to the sequence <0627, 0654>
and by the rules of Unicode normalization, a string <0623>
is in NFC form, but the canonically equivalent string 
<0627, 0654> is not.

As Patrik pointed out, "IDNA2008 require the string to be in NFC form".
More specifically, Section 5.2 says the result of the processing
Section 5.2 MUST be Unicode string in NFC form. And John also
echoed this:

> -- IDNA2008 itself (RFC 5890-5893) are very clear that input
> (and U-labels) must already be in NFC form.  

So Abdulrahman's example and question indicate to me a
problem with an implementation -- not with either the
RFC's or mapping.

Using a string <..., 0627, 0654, ...> as input to IDNA2008
processing should simply result in an error, not a different
(but presumed valid) punycode as the same string
mapped to <..., 0623, ...>.

But as a response to Shawn's comment, the following seems
to me to be non sequitur:

> -- While "many people" would use UTR 46, "many people" may use
> RFC 5895 instead, or no mapping at all.  They are not compatible.

The way I think Abdulrahman's example should be taken is
that local expectation (and every Unicode implementer's
expectation, for that matter) is that an input string
<..., 0627, 0654, ...> and an input string <..., 0623, ...>
SHOULD be taken as behaving equivalently, because they
are canonically equivalent, after all. You cannot satisfy
that local expectation by simply feeding both strings to
a conformant implementation of RFC 5891 unchanged, as noted
above. You would get an error for the input which is not
in NFC form.

But the answer here is to be found in the rationale document,
RFC 5894:

"In principle, an application ought to take user input of a
domain name and convert it to the set of Unicode code points
that represent the domain name the user intends. As a practical
matter, of course, determining user intent is a tricky business,
so an application needs to choose a reasonable mapping from
user input. That may differ based on the particular
circumstances of a user, dpending on locale, language, type
of input method, etc. It is up to the application to make
a reasonable choice."

In the case of canonically equivalent strings, however,
determining user intent is *NOT* a tricky business, and
the "reasonable choice" that every application should make
is to normalize the input strings ("map", if you like) to
NFC form, *before* processing according to RFC 5891. To
do otherwise would be to guarantee the kind of
head-scratching query that Abdulrahman's example presented.

That is part of the pre-processing specified in UTS #46,
and it is the non-tricky, obvious, consensual part of
that pre-processing. It doesn't even touch upon the
more difficult issue of the transitional incompatibilities
between IDNA2003 and IDNA2008.

Incidentally, the mapping recommended in RFC 5895, while
not dealing with any of the tricky transition issues,
also clearly recommends:

  "3. All characters are mapped using Unicode Normalization
      Form C (NFC). ..."

As I said, this is the obvious, consensual part of the
problem -- and doesn't venture at all into the murkier
waters of language- or locale-specific DWIM issues.

--Ken