Normative data files (Re: Normalization of Hangul)

Harald Alvestrand harald at alvestrand.no
Thu Feb 21 11:10:47 CET 2008


Kent Karlsson wrote:
> Yangwoo Ko wrote:
>   
>> As described in Section 3.12 of Unicode Standard, Hangul 
>> syllable code 
>> points are obtained by indexing through a 3-dimensional table. And 
>> decomposition is just reverse of that operation. I don't know how to 
>> describe that process by an algorithm other othan that is 
>> given in UAX #15.
>>     
>
> Hangul syllable canonical decompositions can be handled
> **like all other canonical decompositions** by using
> a table of 11172 entries that begins like this:
>
> AC00; 1100 1161 # HANGUL SYLLABLE GA
> AC01; AC00 11A8 # HANGUL SYLLABLE GAG
> AC02; AC00 11A9 # HANGUL SYLLABLE GAGG
> AC03; AC00 11AA # HANGUL SYLLABLE GAGS
> AC04; AC00 11AB # HANGUL SYLLABLE GAN
> ...
> And the last entry is:
> D7A3; D788 11C2 # HANGUL SYLLABLE HIH
>
> Using arithmetic is just an optimisation of that table.
>   
Kent,

are you referring to an existing data file, released as part of Unicode, 
or are you referring to an imaginary data file that could have been 
created by someone who understands how the Hangul encoding works?

The only place I can find anything like this is in NormalizationTest.txt.

As Ken's pointed out, it's important to know which parts of the Unicode 
specification are normative and which are "merely" showing the result of 
applying some function (whose description is normative) to data from 
some other pieces of the specification.

And I'm very far from sure I know which are which - since Ken (if I 
understood him rightly) insists that even "DerivedCoreProperties.txt" is 
normative, and the algorithms used to produce it aren't.

Ken: The "5.0.0/ucd" directory at the Unicode FTP server contains the 
following files:

-r--r--r--   1 ftp      ftp         11232 Jul 14  2006 ArabicShaping.txt
-r--r--r--   1 ftp      ftp         23759 Feb 17  2006 BidiMirroring.txt
-r--r--r--   1 ftp      ftp          5455 Feb 15  2006 Blocks.txt
-r--r--r--   1 ftp      ftp         58956 Mar  7  2006 CaseFolding.txt
-r--r--r--   1 ftp      ftp          8085 May 23  2006 
CompositionExclusions.txt
-r--r--r--   1 ftp      ftp         58689 Jul 15  2006 DerivedAge.txt
-r--r--r--   1 ftp      ftp        416587 Mar  7  2006 
DerivedCoreProperties.txt
-r--r--r--   1 ftp      ftp        203964 Jun  7  2006 
DerivedNormalizationProps.txt
-r--r--r--   1 ftp      ftp           549 Feb 23  2006 
DerivedProperties.html
-r--r--r--   1 ftp      ftp        654958 Feb 15  2006 EastAsianWidth.txt
-r--r--r--   1 ftp      ftp         51026 Mar 10  2006 
HangulSyllableType.txt
-r--r--r--   1 ftp      ftp        146815 Jul 11  2006 Index.txt
-r--r--r--   1 ftp      ftp          3205 Jul 14  2006 Jamo.txt
-r--r--r--   1 ftp      ftp        708401 May 24  2006 LineBreak.txt
-r--r--r--   1 ftp      ftp          1088 May 25  2006 NameAliases.txt
-r--r--r--   1 ftp      ftp          3911 May 23  2006 NamedSequences.txt
-r--r--r--   1 ftp      ftp          3330 May 23  2006 
NamedSequencesProv.txt
-r--r--r--   1 ftp      ftp         17424 Jul 13  2006 NamesList.html
-r--r--r--   1 ftp      ftp        870186 Jul  5  2006 NamesList.txt
-r--r--r--   1 ftp      ftp          2033 Jul 14  2006 
NormalizationCorrections.txt
-r--r--r--   1 ftp      ftp       2146124 Jun  7  2006 NormalizationTest.txt
-r--r--r--   1 ftp      ftp           549 Feb 23  2006 PropList.html
-r--r--r--   1 ftp      ftp         73598 Jun  8  2006 PropList.txt
-r--r--r--   1 ftp      ftp          5192 Mar  7  2006 PropertyAliases.txt
-r--r--r--   1 ftp      ftp         17127 Mar  7  2006 
PropertyValueAliases.txt
-r--r--r--   1 ftp      ftp           362 Jul 14  2006 ReadMe.txt
-r--r--r--   1 ftp      ftp         95962 Mar 10  2006 Scripts.txt
-r--r--r--   1 ftp      ftp         15590 Mar  7  2006 SpecialCasing.txt
-r--r--r--   1 ftp      ftp         33913 Jul 14  2006 
StandardizedVariants.html
-r--r--r--   1 ftp      ftp          7187 Jan 17  2006 
StandardizedVariants.txt
-r--r--r--   1 ftp      ftp        123344 Jul 15  2006 UCD.html
-r--r--r--   1 ftp      ftp           549 Feb 23  2006 
UnicodeCharacterDatabase.html
-r--r--r--   1 ftp      ftp       1038607 May 23  2006 UnicodeData.txt
-r--r--r--   1 ftp      ftp        112176 Jul 11  2006 Unihan.html
-r--r--r--   1 ftp      ftp      28897863 Jul 10  2006 Unihan.txt
-r--r--r--   1 ftp      ftp       6107283 Jul 10  2006 Unihan.zip

Directory "auxiliary":

-r--r--r--   1 ftp      ftp         64084 Mar 10  2006 
GraphemeBreakProperty.txt
-r--r--r--   1 ftp      ftp          8160 Jun 14  2006 
GraphemeBreakTest.html
-r--r--r--   1 ftp      ftp         10850 Jun 13  2006 GraphemeBreakTest.txt
-r--r--r--   1 ftp      ftp        120390 Mar 10  2006 
SentenceBreakProperty.txt
-r--r--r--   1 ftp      ftp         69762 Jun 14  2006 
SentenceBreakTest.html
-r--r--r--   1 ftp      ftp         46342 Jun 13  2006 SentenceBreakTest.txt
-r--r--r--   1 ftp      ftp         38896 Jun  8  2006 WordBreakProperty.txt
-r--r--r--   1 ftp      ftp         32534 Jun 14  2006 WordBreakTest.html
-r--r--r--   1 ftp      ftp         75950 Jun 13  2006 WordBreakTest.txt

Directory "extracted":

-r--r--r--   1 ftp      ftp         93649 Mar 10  2006 DerivedBidiClass.txt
-r--r--r--   1 ftp      ftp         15928 Mar  3  2006 
DerivedBinaryProperties.txt
-r--r--r--   1 ftp      ftp         98291 Mar 10  2006 
DerivedCombiningClass.txt
-r--r--r--   1 ftp      ftp         68489 Jun  7  2006 
DerivedDecompositionType.txt
-r--r--r--   1 ftp      ftp        100529 Mar 10  2006 
DerivedEastAsianWidth.txt
-r--r--r--   1 ftp      ftp        167385 Mar  3  2006 
DerivedGeneralCategory.txt
-r--r--r--   1 ftp      ftp         12504 Mar 10  2006 
DerivedJoiningGroup.txt
-r--r--r--   1 ftp      ftp         15338 Mar 10  2006 
DerivedJoiningType.txt
-r--r--r--   1 ftp      ftp        152964 Jun  7  2006 DerivedLineBreak.txt
-r--r--r--   1 ftp      ftp         11260 Mar 10  2006 
DerivedNumericType.txt
-r--r--r--   1 ftp      ftp         59594 Mar  3  2006 
DerivedNumericValues.txt

Can you please identify which ones of these are normative, and which 
ones (if any) are not?

(It would also be convenient if you could identify a resource for 
getting that answer without asking you....)

                     Harald



More information about the Idna-update mailing list