toNFKC(toCaseFold(toNFKC(cp))) != cp and toNFKC failures

Mark Davis ☕ mark at macchiato.com
Fri May 27 18:26:28 CEST 2011


For ICU I filed a bug.

Mark

*— Il meglio è l’inimico del bene —*


On Fri, May 27, 2011 at 08:40, Simon Josefsson <simon at josefsson.org> wrote:

> Mark Davis ☕ <mark at macchiato.com> writes:
>
> > toNKFC is defined over all code points, including U+D800.
>
> At least two common NFKC implementations, Libunistring and ICU, appears
> to reject U+D800.  For ICU, the link below illustrate this.
>
> > The condition is moot anyway, since D800 is excluded by other clauses.
>
> Yes, the final 'Else DISALLOWED' clause, according to my code.
>
> /Simon
>
> >
> > Mark
> >
> > *— Il meglio è l’inimico del bene —*
> >
> >
> > On Fri, May 27, 2011 at 02:43, Simon Josefsson <simon at josefsson.org>
> wrote:
> >
> >> I'm looking at RFC 5892 section 2.2 which says:
> >>
> >>   2.2.  Unstable (B)
> >>
> >>   B: toNFKC(toCaseFold(toNFKC(cp))) != cp
> >>
> >>   This category is used to group the characters that are not stable
> >>   under Normalization Form K (NFKC) and case folding.  In general,
> >>   these code points are not suitable for use for IDN.
> >>
> >>   The toCaseFold() operation is defined in Section 3.13 of The Unicode
> >>   Standard [Unicode].
> >>
> >>   The toNFKC() operation returns the code point in normalization form
> >>   KC.  For more information, see Section 5 of Unicode Standard Annex
> >>   #15 [TR15].
> >>
> >>   It should be noted that NFKC is used, although Normalization Form C
> >>   (NFC) is used in the "IDNA Protocol" document [RFC5891].
> >>
> >> The toNFKC operation fails for some code points that aren't characters.
> >> For example U+D800 is not a character, and normalization will fail:
> >>
> >> http://demo.icu-project.org/icu-bin/nbrowser?t=&s=D800&uv=0
> >>
> >> How should the "Unstable" property be evaluated when toNFKC fails?
> >>
> >> Am I correct in using toNFKC(cp) = UNDEFINED for this situation, and
> >> specify that toCaseFold(UNDEFINED) = UNDEFINED and toNFKC(UNDEFINED) =
> >> UNDEFINED and then also that UNDEFINED is never equal to any code point?
> >>
> >> /Simon
> >> _______________________________________________
> >> Idna-update mailing list
> >> Idna-update at alvestrand.no
> >> http://www.alvestrand.no/mailman/listinfo/idna-update
> >>
> > _______________________________________________
> > Idna-update mailing list
> > Idna-update at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/idna-update
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20110527/aee50113/attachment-0001.html>


More information about the Idna-update mailing list