toNFKC(toCaseFold(toNFKC(cp))) != cp and toNFKC failures

Simon Josefsson simon at josefsson.org
Fri May 27 11:43:46 CEST 2011


I'm looking at RFC 5892 section 2.2 which says:

   2.2.  Unstable (B)

   B: toNFKC(toCaseFold(toNFKC(cp))) != cp

   This category is used to group the characters that are not stable
   under Normalization Form K (NFKC) and case folding.  In general,
   these code points are not suitable for use for IDN.

   The toCaseFold() operation is defined in Section 3.13 of The Unicode
   Standard [Unicode].

   The toNFKC() operation returns the code point in normalization form
   KC.  For more information, see Section 5 of Unicode Standard Annex
   #15 [TR15].

   It should be noted that NFKC is used, although Normalization Form C
   (NFC) is used in the "IDNA Protocol" document [RFC5891].

The toNFKC operation fails for some code points that aren't characters.
For example U+D800 is not a character, and normalization will fail:

http://demo.icu-project.org/icu-bin/nbrowser?t=&s=D800&uv=0

How should the "Unstable" property be evaluated when toNFKC fails?

Am I correct in using toNFKC(cp) = UNDEFINED for this situation, and
specify that toCaseFold(UNDEFINED) = UNDEFINED and toNFKC(UNDEFINED) =
UNDEFINED and then also that UNDEFINED is never equal to any code point?

/Simon


More information about the Idna-update mailing list