plea for NFKC & case-folding, suggestions for definitions

Tue Mar 3 07:28:33 CET 2009

Harald Alvestrand <harald at alvestrand.no> wrote:

> do you think that the canonicalization function you're positing can be
> described in an Unicode-version-independent way?

I think it can, but I'll need help from our resident Unicode experts to
provide the definition.

The Unicode standard promises normalization stability from
version 4.1 onward, and case-folding stability from version
5.0 onward.  Section 3.13 of version 5.0 suggests this for
compatibility-normalize-and-case-fold:

     NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) =

I don't know why so many iterations are needed, and I don't know why the
first step is NFD rather than NFKD.

We would of course want to change the last NFKD (or both of them) to
NFKC.

After applying this function, we would then check the result for
disallowed code points and other violations (described in the existing
drafts), and either return it or return an error.

We'd have to consider & specify what to do about the handful of obscure
incompatibilites between Unicode 3.2 and 5.0, but we should never have
to do that again because of the newer Unicode stability policies.

Unicode experts, please comment on the feasibility of this approach, and
suggest ways to simplify it if you can think of any.

Thanks,
AMC