Normalization of Hangul

Patrik Fältström patrik at frobbit.se
Wed Feb 20 23:20:02 CET 2008


The files I am reading to be able to create the tables document are:

CompositionExclusions.txt
UnicodeData.txt
CaseFolding.txt
Blocks.txt
Scripts.txt
PropList.txt
DerivedCoreProperties.txt

After moving from using Other_Default_Ignorable... to  
Default_Ignorable I do not think I need PropList.txt anymore, but I am  
still reading it "just in case".

I will in later versions of the tables document have more clear  
references to the files that are needed.

I have tried to use a taxonomy for the various properties so the  
aliases files are not needed. Those aliases does not make the Unicode  
Standard easier to read btw.

   Patrik

On 20 feb 2008, at 16.16, Mark Davis wrote:

> Yes, those sections are what is required. All told:
>
>  - NFC and NKFC are defined in Section 5 of UAX#15 (
>  http://unicode.org/reports/tr15/#Specification), which:
>  - references Canonical Decomposition
>     - defines the composition processes
>     - Canonical Decomposition is defined on p 96 of U5.0
>     - *D68 Canonical decomposition: The decomposition of a character
>     that results from recursively applying the canonical mappings
> found in the
>     Unicode Character Database and those described in Section 3.12,
>     Conjoining Jamo Behavior, until no characters can be further
> decomposed, and
>     then reordering nonspacing marks according to Section 3.11,
>     Canonical Ordering Behavior.*
>  - Data
>  - UnicodeData.txt
>     - CompositionExclusion.txt
>
> Mark
>
> Note: There is still final editorial work being done on
> http://www.unicode.org/reports/tr15/tr15-28.html, so if you have any
> suggestions for editorial clarifications, now's the time!
>
>
>
> On Feb 20, 2008 12:04 AM, Harald Alvestrand <harald at alvestrand.no>  
> wrote:
>
>> Kenneth Whistler skrev:
>>> Patrik asked:
>>>
>>>
>>>> Is there a different specification of the normalization algorithm  
>>>> for
>>>> Hangul than what now exists, an algorithm specified that is based  
>>>> upon
>>>> the fact one should know how integer arithmetic in Java works?
>>>>
>>>
>>> Well, the specification of exactly how Hangul decomposition
>>> and (re)composition works is in Section 3.12 of the standard,
>>> pp. 121 - 123. That doesn't depend on integer arithmetic in Java.
>>> All you do is plug the decomposition and composition rules
>>> for Hangul into the relevant part of the UAX #15 normalization
>>> that requires decomposition or composition of strings.
>>>
>> Ken,
>>
>> for those of us who don't have the whole Unicode standard in their
>> brains at once:
>> The NFKC and NFC algorithms depend on:
>>
>> - Decomposition as described in the standard section 3.5
>> - UnicodeData.txt for its values
>> - section 3.11 for the canonical reordering of combining marks
>> - section 3.12 for decomposition of Hangul
>> - Composition as described in UAX#15 section 5
>> - UnicodeData.txt for its values
>> - CompositionExclusion.txt for exceptions
>> - section 3.12 of the standard for Hangul composition
>>
>> Is that the complete set of what one needs to read to implement  
>> NFKC and
>> NFC, or is there Yet Another Data File Or Algorithm we have  
>> overlooked?
>>> Normally, of course, you just depend on a library API that
>>> does the normalization for you (including the proper handling
>>> of Hangul).
>> Irrelevant for the purpose of writing a standard. Very useful for
>> testing it.
>>
>>              Harald
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>
>
>
> -- 
> Mark



More information about the Idna-update mailing list