Normalization of Hangul
Kenneth Whistler
kenw at sybase.com
Wed Feb 20 21:30:03 CET 2008
Harald asked:
> for those of us who don't have the whole Unicode standard in their
> brains at once:
> The NFKC and NFC algorithms depend on:
>
> - Decomposition as described in the standard section 3.5
> - UnicodeData.txt for its values
> - section 3.11 for the canonical reordering of combining marks
> - section 3.12 for decomposition of Hangul
> - Composition as described in UAX#15 section 5
> - UnicodeData.txt for its values
> - CompositionExclusion.txt for exceptions
> - section 3.12 of the standard for Hangul composition
>
> Is that the complete set of what one needs to read to implement NFKC and
> NFC,
Yes.
> or is there Yet Another Data File Or Algorithm we have overlooked?
No.
Another way of factoring this is as follows:
To implement Unicode Normalization, you must implement:
A. Decomposition (canonical and compatibility)
B. Calculation of starter and blocked status for each character
C. Recomposition
To implement canonical decomposition (D68, p. 96) and
compatibility decomposition (D65, p. 95), you must implement:
D. Recursive application of canonical and compatibility
mappings from UnicodeData.txt. (Field 5 of that file.)
E. Hangul decomposition as defined in Section 3.12, p. 122.
(And as Kent pointed out, the arithmetic algorithm there
is simply an optimization for what would be the logical
equivalent: a list of 11172 explicit canonical mappings
for Hangul syllables.)
F. Canonical Ordering, as defined in Section 3.11, p. 115,
and which depends only on the Canonical_Combining_Class
property from UnicodeData.txt. (Field 3 of that file.)
Calculation of starter and blocked status for each character
in the decomposed string requires only the Canonical_Combining_Class
property, which you already have in hand to implement Canonical Ordering.
To implement recomposition, you must implement:
G. Pairwise testing of starter + character pairs against
canonical decompositions, the data for which you already
have from UnicodeData.txt to implement decomposition.
H. Hangul composition as defined in Section 3.12, p. 121.
I. Exception testing for pairs, based on the exceptions
listed in CompositionExclusions.txt.
That's it. (And yes it is complicated, and yes, UAX #15 is
unnecessarily obscure about laying it all out clearly -- but
the UTC is already on notice to address that in the future.)
If you are reimplementing Unicode Normalization from scratch,
there are some useful hints:
1. Implementing decomposition with runtime recursion through
the decomposition mappings is not very efficient. It
is easier and more efficient to pre-calculate and store
the full decompositions and then just look up the results
at runtime.
2. Likewise, the recomposition and handling of exception testing
from CompositionExclusions.txt can be handled most
efficiently if the entire recomposition is handled by
a precalculated pairwise composition table that incorporates
all the exceptions. This way you get the required answer
with a single lookup.
3. Implementations concerned with speed may make further
optimizations -- in particular to avoid engaging the
step for Canonical Ordering when the outcome would be
unchanged. For that, there are precalculated property
values in DerivedNormalizationProps.txt which can be
used to speed things up. However, those properties and
that data file are not a *required* part of a conformant
implementation of Unicode Normalization.
--Ken
More information about the Idna-update
mailing list