Normalization of Hangul

Wed Feb 20 21:30:03 CET 2008

Harald asked:

> for those of us who don't have the whole Unicode standard in their
> brains at once:
> The NFKC and NFC algorithms depend on:
> 
> - Decomposition as described in the standard section 3.5
>   - UnicodeData.txt for its values
>   - section 3.11 for the canonical reordering of combining marks
>   - section 3.12 for decomposition of Hangul
> - Composition as described in UAX#15 section 5
>   - UnicodeData.txt for its values
>   - CompositionExclusion.txt for exceptions
>   - section 3.12 of the standard for Hangul composition
> 
> Is that the complete set of what one needs to read to implement NFKC and
> NFC,

Yes.

> or is there Yet Another Data File Or Algorithm we have overlooked?

No.

Another way of factoring this is as follows:

To implement Unicode Normalization, you must implement:

  A. Decomposition (canonical and compatibility)

  B. Calculation of starter and blocked status for each character

  C. Recomposition

To implement canonical decomposition (D68, p. 96) and
compatibility decomposition (D65, p. 95), you must implement:

  D. Recursive application of canonical and compatibility
     mappings from UnicodeData.txt. (Field 5 of that file.)

  E. Hangul decomposition as defined in Section 3.12, p. 122.
     (And as Kent pointed out, the arithmetic algorithm there
     is simply an optimization for what would be the logical
     equivalent: a list of 11172 explicit canonical mappings
     for Hangul syllables.)

  F. Canonical Ordering, as defined in Section 3.11, p. 115,
     and which depends only on the Canonical_Combining_Class
     property from UnicodeData.txt. (Field 3 of that file.)

Calculation of starter and blocked status for each character
in the decomposed string requires only the Canonical_Combining_Class
property, which you already have in hand to implement Canonical Ordering.

To implement recomposition, you must implement:

   G. Pairwise testing of starter + character pairs against
      canonical decompositions, the data for which you already
      have from UnicodeData.txt to implement decomposition.

   H. Hangul composition as defined in Section 3.12, p. 121.

   I. Exception testing for pairs, based on the exceptions
      listed in CompositionExclusions.txt.

That's it. (And yes it is complicated, and yes, UAX #15 is
unnecessarily obscure about laying it all out clearly -- but
the UTC is already on notice to address that in the future.)

If you are reimplementing Unicode Normalization from scratch,
there are some useful hints:

1. Implementing decomposition with runtime recursion through
   the decomposition mappings is not very efficient. It
   is easier and more efficient to pre-calculate and store
   the full decompositions and then just look up the results
   at runtime.

2. Likewise, the recomposition and handling of exception testing
   from CompositionExclusions.txt can be handled most
   efficiently if the entire recomposition is handled by
   a precalculated pairwise composition table that incorporates
   all the exceptions. This way you get the required answer
   with a single lookup.

3. Implementations concerned with speed may make further
   optimizations -- in particular to avoid engaging the
   step for Canonical Ordering when the outcome would be
   unchanged. For that, there are precalculated property
   values in DerivedNormalizationProps.txt which can be
   used to speed things up. However, those properties and
   that data file are not a *required* part of a conformant
   implementation of Unicode Normalization.

--Ken