Normative data files (Re: Normalization of Hangul)
Harald Alvestrand
harald at alvestrand.no
Thu Feb 21 11:10:47 CET 2008
Kent Karlsson wrote:
> Yangwoo Ko wrote:
>
>> As described in Section 3.12 of Unicode Standard, Hangul
>> syllable code
>> points are obtained by indexing through a 3-dimensional table. And
>> decomposition is just reverse of that operation. I don't know how to
>> describe that process by an algorithm other othan that is
>> given in UAX #15.
>>
>
> Hangul syllable canonical decompositions can be handled
> **like all other canonical decompositions** by using
> a table of 11172 entries that begins like this:
>
> AC00; 1100 1161 # HANGUL SYLLABLE GA
> AC01; AC00 11A8 # HANGUL SYLLABLE GAG
> AC02; AC00 11A9 # HANGUL SYLLABLE GAGG
> AC03; AC00 11AA # HANGUL SYLLABLE GAGS
> AC04; AC00 11AB # HANGUL SYLLABLE GAN
> ...
> And the last entry is:
> D7A3; D788 11C2 # HANGUL SYLLABLE HIH
>
> Using arithmetic is just an optimisation of that table.
>
Kent,
are you referring to an existing data file, released as part of Unicode,
or are you referring to an imaginary data file that could have been
created by someone who understands how the Hangul encoding works?
The only place I can find anything like this is in NormalizationTest.txt.
As Ken's pointed out, it's important to know which parts of the Unicode
specification are normative and which are "merely" showing the result of
applying some function (whose description is normative) to data from
some other pieces of the specification.
And I'm very far from sure I know which are which - since Ken (if I
understood him rightly) insists that even "DerivedCoreProperties.txt" is
normative, and the algorithms used to produce it aren't.
Ken: The "5.0.0/ucd" directory at the Unicode FTP server contains the
following files:
-r--r--r-- 1 ftp ftp 11232 Jul 14 2006 ArabicShaping.txt
-r--r--r-- 1 ftp ftp 23759 Feb 17 2006 BidiMirroring.txt
-r--r--r-- 1 ftp ftp 5455 Feb 15 2006 Blocks.txt
-r--r--r-- 1 ftp ftp 58956 Mar 7 2006 CaseFolding.txt
-r--r--r-- 1 ftp ftp 8085 May 23 2006
CompositionExclusions.txt
-r--r--r-- 1 ftp ftp 58689 Jul 15 2006 DerivedAge.txt
-r--r--r-- 1 ftp ftp 416587 Mar 7 2006
DerivedCoreProperties.txt
-r--r--r-- 1 ftp ftp 203964 Jun 7 2006
DerivedNormalizationProps.txt
-r--r--r-- 1 ftp ftp 549 Feb 23 2006
DerivedProperties.html
-r--r--r-- 1 ftp ftp 654958 Feb 15 2006 EastAsianWidth.txt
-r--r--r-- 1 ftp ftp 51026 Mar 10 2006
HangulSyllableType.txt
-r--r--r-- 1 ftp ftp 146815 Jul 11 2006 Index.txt
-r--r--r-- 1 ftp ftp 3205 Jul 14 2006 Jamo.txt
-r--r--r-- 1 ftp ftp 708401 May 24 2006 LineBreak.txt
-r--r--r-- 1 ftp ftp 1088 May 25 2006 NameAliases.txt
-r--r--r-- 1 ftp ftp 3911 May 23 2006 NamedSequences.txt
-r--r--r-- 1 ftp ftp 3330 May 23 2006
NamedSequencesProv.txt
-r--r--r-- 1 ftp ftp 17424 Jul 13 2006 NamesList.html
-r--r--r-- 1 ftp ftp 870186 Jul 5 2006 NamesList.txt
-r--r--r-- 1 ftp ftp 2033 Jul 14 2006
NormalizationCorrections.txt
-r--r--r-- 1 ftp ftp 2146124 Jun 7 2006 NormalizationTest.txt
-r--r--r-- 1 ftp ftp 549 Feb 23 2006 PropList.html
-r--r--r-- 1 ftp ftp 73598 Jun 8 2006 PropList.txt
-r--r--r-- 1 ftp ftp 5192 Mar 7 2006 PropertyAliases.txt
-r--r--r-- 1 ftp ftp 17127 Mar 7 2006
PropertyValueAliases.txt
-r--r--r-- 1 ftp ftp 362 Jul 14 2006 ReadMe.txt
-r--r--r-- 1 ftp ftp 95962 Mar 10 2006 Scripts.txt
-r--r--r-- 1 ftp ftp 15590 Mar 7 2006 SpecialCasing.txt
-r--r--r-- 1 ftp ftp 33913 Jul 14 2006
StandardizedVariants.html
-r--r--r-- 1 ftp ftp 7187 Jan 17 2006
StandardizedVariants.txt
-r--r--r-- 1 ftp ftp 123344 Jul 15 2006 UCD.html
-r--r--r-- 1 ftp ftp 549 Feb 23 2006
UnicodeCharacterDatabase.html
-r--r--r-- 1 ftp ftp 1038607 May 23 2006 UnicodeData.txt
-r--r--r-- 1 ftp ftp 112176 Jul 11 2006 Unihan.html
-r--r--r-- 1 ftp ftp 28897863 Jul 10 2006 Unihan.txt
-r--r--r-- 1 ftp ftp 6107283 Jul 10 2006 Unihan.zip
Directory "auxiliary":
-r--r--r-- 1 ftp ftp 64084 Mar 10 2006
GraphemeBreakProperty.txt
-r--r--r-- 1 ftp ftp 8160 Jun 14 2006
GraphemeBreakTest.html
-r--r--r-- 1 ftp ftp 10850 Jun 13 2006 GraphemeBreakTest.txt
-r--r--r-- 1 ftp ftp 120390 Mar 10 2006
SentenceBreakProperty.txt
-r--r--r-- 1 ftp ftp 69762 Jun 14 2006
SentenceBreakTest.html
-r--r--r-- 1 ftp ftp 46342 Jun 13 2006 SentenceBreakTest.txt
-r--r--r-- 1 ftp ftp 38896 Jun 8 2006 WordBreakProperty.txt
-r--r--r-- 1 ftp ftp 32534 Jun 14 2006 WordBreakTest.html
-r--r--r-- 1 ftp ftp 75950 Jun 13 2006 WordBreakTest.txt
Directory "extracted":
-r--r--r-- 1 ftp ftp 93649 Mar 10 2006 DerivedBidiClass.txt
-r--r--r-- 1 ftp ftp 15928 Mar 3 2006
DerivedBinaryProperties.txt
-r--r--r-- 1 ftp ftp 98291 Mar 10 2006
DerivedCombiningClass.txt
-r--r--r-- 1 ftp ftp 68489 Jun 7 2006
DerivedDecompositionType.txt
-r--r--r-- 1 ftp ftp 100529 Mar 10 2006
DerivedEastAsianWidth.txt
-r--r--r-- 1 ftp ftp 167385 Mar 3 2006
DerivedGeneralCategory.txt
-r--r--r-- 1 ftp ftp 12504 Mar 10 2006
DerivedJoiningGroup.txt
-r--r--r-- 1 ftp ftp 15338 Mar 10 2006
DerivedJoiningType.txt
-r--r--r-- 1 ftp ftp 152964 Jun 7 2006 DerivedLineBreak.txt
-r--r--r-- 1 ftp ftp 11260 Mar 10 2006
DerivedNumericType.txt
-r--r--r-- 1 ftp ftp 59594 Mar 3 2006
DerivedNumericValues.txt
Can you please identify which ones of these are normative, and which
ones (if any) are not?
(It would also be convenient if you could identify a resource for
getting that answer without asking you....)
Harald
More information about the Idna-update
mailing list