Jamo [RE: Consensus Call Tranche 8 (Character Adjustments)]

Tue Oct 21 05:56:52 CEST 2008

Dear Prof. Kim,

Many thanks for your mail. It's very important to have direct
communication, especially given the way the IETF works.

At 09:20 08/10/17, k kim wrote:
>> Mark Davis <<mailto:mark at macchiato.com>mark at macchiato.com> Wed, Oct 15, 2008 at 5:56 AM
> 
>> That is, each of the Hangul precomposed syllables decomposes into one or two
> 
>one or two (wrong)---> two or three (correct) (Am I missing something here?)

No, I think your completely right, this was a simple oversight by Mark.

>> combining jamo under NFD, and under NFC that sequence of combining jamo composes 
>> back into that syllable. The comparisons *do* work correctly, 
>> since IDNA labels have to be in NFC.
> 
>- Well, I would have to disagree with you. 
>Let me explain why the above claim is not correct.
> 
>- According to UCS (ISO/IEC 10646),

Can you please tell us which version/section,...?
Ideally, pointing to a copy of the actual text,
or putting it in your mail, is best.

Michel Suignard has said he doesn't know about it,
and as he is the editor of ISO/IEC 10646, he might
be right.

[The only copy of ISO/IEC 10646 is a Japanese translation
(JIS 221) of the 1993 version, which mentions decomposition
of Hangul syllables into initial consonant, medial vowel,
and optional final consonant Jamos in clause 24. The Japanese
language does not distinguish between singular and plural,
so it is not clear from the Japanese translation whether the
English original use language such as "an initial consonant
Jamo" (which would support Michel's position) or
"some initial consonant Jamos" (which would support Prof.
Kim's position). Anyway, the current version of ISO/IEC
10646 may say more and different things.]

>each of the following three can represent 
>Hangul syllable GGA:
>
>1) UAC01 (GGA)
>
>2) U1101 (GG), U1161 (A)
>
>3) U1100 (G), U1100 (G), U1161 (A)
>
> - By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);
> - However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will be changed to 
>U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.
> 
> > The comparisons *do* work correctly, 
> 
>- ??? Isn't it considered comparison failure? (Am I missing something here?)
>- As we saw, NFC/NFD does not work correctly even for modern Hangul,
>(not to mention Old Hangul)!
>
>------------
>
>Note. For example, in XML 1.0 (fourth ed), 2) U1101 (GG), U1161 (A) is NOT allowed since #x1101 is not included; in contrast, 3) U1100 (G), U1100 (G), U1161 (A) IS allowed.
>
>(source :<http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar>http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar)
>
>.#x1100 | [#x1102-#x1103] | [#x1105-#x1107] | #x1109 | [#x110B-#x110C] | 
>
>[#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150 | 
>
>[#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 | #x1165 | #x1167 | #x1169 |
>
>[#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E | #x11A8 | #x11AB | 
>
>[#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB | #x11F0 | 
>
>#x11F9 | ..

I was first quite surprised to see this. Tracing this back,
we find that the explanations in the XML 1.0 REC contain the following:
* Characters which have a font or compatibility decomposition
  (i.e. those with a "compatibility formatting tag" in field 5 of the
   database -- marked by field 5 beginning with a "<") are not allowed.
The base for this is Unicode 2.0. If we look at 
http://www.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt,
we indeed find entries such as:
1101;HANGUL CHOSEONG SSANGKIYEOK;Lo;0;L;<compat> 1100 1100;;;;N;;;;;
which explains the holes in the above list.

The motivation for this, from the XML side, was a strong aversion
against compatibility characters. Unfortunately, in this case,
I think the creators of XML got this wrong, because the Unicode
Standard used and uses compatibility mappings/decompositions for
many different purposes, of which Korean Jamos were a very special
case (see below).

It should be noted that the 5th edition of XML 1.0
(http://www.w3.org/TR/2008/PER-xml-20080205/#NT-NameStartChar),
currently a Proposed Edited Recommendation, simplifies the
restrictions on name characters along the same lines as
XML 1.1 did earlier. So this problem is being taken care of.

Let's look at the history of the entries for characters such as
U+1101, HANGUL CHOSEONG SSANGKIYEOK in the Unicode database.
The "<compat> 1100 1100" field is still present in 
http://www.unicode.org/Public/2.1-Update3/UnicodeData-2.1.8.txt
(December, 1998), but is gone in
http://www.unicode.org/Public/2.1-Update4/UnicodeData-2.1.9.txt
(April, 1999)
The modification history at the end of
http://www.unicode.org/Public/2.1-Update4/ReadMe-2.1.9.txt contains
* Removed <compat> decompositions from the conjoining jamo block:
  U+1100..U+11F8.

These Jamo <compat> decompositions were very special because in
general, any kind of compatibility decomposition (whether just
with a <compat> flag, or a more specific flag) moved from some
compatibility-like characters to non-compatibility characters.
Those non-compatibility characters might then further be
decomposed using canonical decompositions.
The Jamo <compat> decompositions in Version 2.0 to 2.1.8
of Unicode, however, were below the canonical decompositions
of Hangul syllables into (as correctly pointed out) two or
three Jamo. They seem to have confused many people at the
time and caused problems (of which XML 1.0 is an example),
so they were removed.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp