Normalization of Hangul

Wed Feb 20 21:29:37 CET 2008

This may not be that meaningful but I just would like to share my personal
experience. Recently I had a chance to implement all Unicode Normalization
forms plus simple case conversions from scratch.

It was somewhat complicated and yet rather straightforward to implement
and the only docs that I needed to read were what Mark wrote in his email
plus some documents shown at the References section of the UAX15.

The normalization test data, NormalizationTest.txt, provided by the Unicode
consortium also was quite useful in verifying the implementation.

FWIW and in case someone wants to use, the source code and related info are
available at the following:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/common/unicode/u8_textprep.c
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/u8_textprep.h
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/u8_textprep_data.h

http://www.opensolaris.org/os/community/arc/caselog/2007/149/
http://www.opensolaris.org/os/community/arc/caselog/2007/458/

The source code is in C and it should be quite portable and can be taken
out as a standalone code easily. The fourth URL from the top will open up
within a couple of days. The last URL has pointers to manual pages for
the functions.

Ienup

Mark Davis wrote at 02/20/08 07:16:
> Yes, those sections are what is required. All told:
> 
>     * NFC and NKFC are defined in Section 5 of UAX#15
>       (http://unicode.org/reports/tr15/#Specification), which:
>           o references Canonical Decomposition
>           o defines the composition processes
>     * Canonical Decomposition is defined on p 96 of U5.0
>           o /D68 Canonical decomposition: The decomposition of a
>             character that results from recursively applying the
>             canonical mappings found in the Unicode Character Database
>             and those described in Section 3.12, Conjoining Jamo
>             Behavior, until no characters can be further decomposed, and
>             then reordering nonspacing marks according to Section 3.11,
>             Canonical Ordering Behavior./
>     * Data
>           o UnicodeData.txt
>           o CompositionExclusion.txt
> 
> Mark
> 
> Note: There is still final editorial work being done on 
> http://www.unicode.org/reports/tr15/tr15-28.html, so if you have any 
> suggestions for editorial clarifications, now's the time!