Nameprep and NFKC

Vint Cerf vint at google.com
Sat Oct 16 14:58:10 CEST 2010


I think one has to view this from the perspective of trying to bring
coherence and stability to an inherently ambiguous situation. Keep in
mind that the essence of the DNS is matching of a registered string
with an input string. Normalization removes some degree of ambiguity,
making the matching process less computationally expensive. If I have
understood your example correctly, the variation goes away if you
first apply NFC to the various possible cases. IDNA2008 relies on the
transformation of any candidate label into its NFC equivalent before
conversion through punycoding for look up or registration. Failing to
convert the two different strings through NFC before further
processing is an error.

vint

On Sat, Oct 16, 2010 at 5:55 AM, Abdulrahman I. ALGhadir
<aghadir at citc.gov.sa> wrote:
> So basically is that a flaw on the idna2008 demo or is it common
> behavior of the IDNA2008 or is it normalization problem (NFC) ?
>
> -----Original Message-----
> From: Kenneth Whistler [mailto:kenw at sybase.com]
> Sent: 14/Oct/2010 1:38 AM
> To: klensin at jck.com
> Cc: idna-update at lavestrand.no; Abdulrahman I. ALGhadir; kenw at sybase.com
> Subject: RE: Nameprep and NFKC
>
> John Klensin responded to Shawn Steele's comment:
>
>> > Many people would use Unicode UTS#46 in addition to the
>> > IDNA2008 RFCs:
>> >
>> > http://www.unicode.org/reports/tr46/
>> >...
>>
>> Yes, and some of the tables Abdulrahman's note referred to are
>> dependent on UTR 46.  But this is exactly where mapping gets us
>> into trouble:
>>
>> -- Use of UTR 46 with IDNA2008 reduces some incompatibilities
>> with IDNA2003 and may cause a few others.   If I correctly
>> understand Abdulrahman's example, it may be one of those
>> incompatibilities.
>
> It is not.
>
> Abdulrahman's example is a straightforward case of NFC
> normalization.
>
> 0623 is canonically equivalent to the sequence <0627, 0654>
> and by the rules of Unicode normalization, a string <0623>
> is in NFC form, but the canonically equivalent string
> <0627, 0654> is not.
>
> As Patrik pointed out, "IDNA2008 require the string to be in NFC form".
> More specifically, Section 5.2 says the result of the processing
> Section 5.2 MUST be Unicode string in NFC form. And John also
> echoed this:
>
>> -- IDNA2008 itself (RFC 5890-5893) are very clear that input
>> (and U-labels) must already be in NFC form.
>
> So Abdulrahman's example and question indicate to me a
> problem with an implementation -- not with either the
> RFC's or mapping.
>
> Using a string <..., 0627, 0654, ...> as input to IDNA2008
> processing should simply result in an error, not a different
> (but presumed valid) punycode as the same string
> mapped to <..., 0623, ...>.
>
> But as a response to Shawn's comment, the following seems
> to me to be non sequitur:
>
>> -- While "many people" would use UTR 46, "many people" may use
>> RFC 5895 instead, or no mapping at all.  They are not compatible.
>
> The way I think Abdulrahman's example should be taken is
> that local expectation (and every Unicode implementer's
> expectation, for that matter) is that an input string
> <..., 0627, 0654, ...> and an input string <..., 0623, ...>
> SHOULD be taken as behaving equivalently, because they
> are canonically equivalent, after all. You cannot satisfy
> that local expectation by simply feeding both strings to
> a conformant implementation of RFC 5891 unchanged, as noted
> above. You would get an error for the input which is not
> in NFC form.
>
> But the answer here is to be found in the rationale document,
> RFC 5894:
>
> "In principle, an application ought to take user input of a
> domain name and convert it to the set of Unicode code points
> that represent the domain name the user intends. As a practical
> matter, of course, determining user intent is a tricky business,
> so an application needs to choose a reasonable mapping from
> user input. That may differ based on the particular
> circumstances of a user, dpending on locale, language, type
> of input method, etc. It is up to the application to make
> a reasonable choice."
>
> In the case of canonically equivalent strings, however,
> determining user intent is *NOT* a tricky business, and
> the "reasonable choice" that every application should make
> is to normalize the input strings ("map", if you like) to
> NFC form, *before* processing according to RFC 5891. To
> do otherwise would be to guarantee the kind of
> head-scratching query that Abdulrahman's example presented.
>
> That is part of the pre-processing specified in UTS #46,
> and it is the non-tricky, obvious, consensual part of
> that pre-processing. It doesn't even touch upon the
> more difficult issue of the transitional incompatibilities
> between IDNA2003 and IDNA2008.
>
> Incidentally, the mapping recommended in RFC 5895, while
> not dealing with any of the tricky transition issues,
> also clearly recommends:
>
>  "3. All characters are mapped using Unicode Normalization
>      Form C (NFC). ..."
>
> As I said, this is the obvious, consensual part of the
> problem -- and doesn't venture at all into the murkier
> waters of language- or locale-specific DWIM issues.
>
> --Ken
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>


More information about the Idna-update mailing list