Consensus Call Tranche 8 Summary - Addendum

Mon Oct 20 12:25:54 CEST 2008

On 20 okt 2008, at 11.57, Mark Davis wrote:

> I think this subject has been seriously muddled by misinformation,  
> thus
> causing those who are not completely familiar with the way that  
> Hangul is
> encoded in Unicode to be mislead.

What confuses me is that the Koreans say there is a problem, and you  
say there is not a problem.

Specifically I am confused of this first sentence of yours. No, I am  
not completely familiar with Hangul as I am not Korean, so of course I  
might be mislead.

But the people from Korea are people I have to trust as they are the  
ones that use the language. And I take for granted you do not imply  
the people from Korea are mislead on how Hangul is encoded in Unicode?

They say there is a problem. You say there is not any problem.

How do we move forward?

    Patrik

> The proposed label step:
>
> toNFKC(toCaseFolded(toNFKC(label))) != label
>
>
> is pointless, since case folding doesn't have any effect on Korean
> characters (jamo or syllables) and any label is guaranteed to be in  
> NFKC
> format anyway, due to other provisions in Tables and Protocol. The
> compatibility jamos are also not in question, since they also not in  
> NFKC.
>
> The characters in question are all and only the following:
>
> The conjoining Jamos.
>
> U+1100 <http://unicode.org/cldr/utility/character.jsp?a=1100> HANGUL
> CHOSEONG KIYEOK
> …{88}…U+1159 <http://unicode.org/cldr/utility/character.jsp? 
> a=1159> HANGUL
> CHOSEONG YEORINHIEUH
> U+1161 <http://unicode.org/cldr/utility/character.jsp?a=1161> HANGUL
> JUNGSEONG A
> …{64}…U+11A2 <http://unicode.org/cldr/utility/character.jsp? 
> a=11A2> HANGUL
> JUNGSEONG SSANGARAEA
> U+11A8 <http://unicode.org/cldr/utility/character.jsp?a=11A8> HANGUL
> JONGSEONG KIYEOK
> …{80}…U+11F9 <http://unicode.org/cldr/utility/character.jsp? 
> a=11F9> HANGUL
> JONGSEONG YEORINHIEUH
>
>
> The Hangul Syllables.
>
> U+AC00 <http://unicode.org/cldr/utility/character.jsp?a=AC00> ( 가 )  
> HANGUL
> SYLLABLE GA
> …{11170}…U+D7A3 <http://unicode.org/cldr/utility/character.jsp? 
> a=D7A3> ( 힣 )
> HANGUL SYLLABLE HIH
>
>
> Any sequence of Jamo syllables that could correspond to a Hangul  
> Syllable
> according to the Unicode Standard canonical equivalence is  
> transformed into
> it by the toNFKC function. Thus a sequence of Jamo syllables that  
> could
> correspond to a HS according to the Unicode Standard *cannot* be in an
> IDNA2008 label according to Tables and Protocol.
>
> That's is meant by saying that there is no comparison problem.  
> Anything that
> is equivalent to a HS according to TUS will already be a HS in an  
> IDNA2008
> label according to Tables and Protocol already.
>
> Now, one could have a contextual rule that forbade Jamo in  
> situations where
> they could not be part of a valid syllable, and if people really  
> wanted that
> we could do it.
>
> But frankly, it is not worth the effort. Unlike the case of the ZW  
> joiners,
> these are not invisible characters; the worst that would happen is  
> someone
> would see nonsense on the screen -- but it is not our place to try  
> to forbid
> nonsensical labels.
>
> This is clearly a case where the Korean NIC is free to narrow the  
> set of
> labels they accept to exclude non-modern characters, just as the  
> German NIC
> is free to exclude archaic German characters, or the British NIC  
> free to
> exclude archaic English characters (like Þ or ð).
>
> Mark
>
>
> On Mon, Oct 20, 2008 at 11:10 AM, Vint Cerf <vint at google.com> wrote:
>
>> Consensus Call Tranche 8 (Character Adjustments) - Addendum
>>
>> I neglected to summarize a number of messages relating to the JAMO
>> discussion (they had subject fields that were specific to the JAMO
>> discussion and did not appear when all the email was sorted with the
>> original subject of the consensus call)
>>
>> As a result, the polling actually produced 9 YES and 8 NO - still  
>> clearly
>> no final consensus.
>>
>> (8.c) Disallow conjoining Hangul jamo per recommendation from
>> KRNIC and others, permitting only precomposed syllables.
>>
>> COMMENTS:
>>
>> I agree with the line of thought that we really should not  
>> disregard the
>> results of the consensus position established by the most relevant  
>> language
>> community after a rather extensive consensus process, so in  
>> general, I would
>> side with the experts in Korea.
>>
>>
>> Nevertheless, having been through this discussion for many times, I
>> understand that there are opinions otherwise and am hoping to make a
>> suggestion that could reconcile the lines of thought and be  
>> consistent with
>> our architecture.  When we last discussed the issue of conjoining  
>> Hangul
>> Jamo, I had suggested exploring the possibility of addressing them  
>> in the
>> following manner:
>>
>>
>> 1. categorize all Hangul Jamo as CONTEXTO
>> 2. add stability contextual rule for these codepoints where the  
>> following
>> must be true:
>> toNFKC(toCaseFolded(toNFKC(label))) != label
>>
>>
>>
>>
>> I am not familiar enough with Korean, but this might strike a  
>> graceful
>> balance between disallowing conjoining jamo that forms a modern  
>> hangul and
>> continue to allow archaic Jamo without creating too much of a  
>> confusion?...
>>
>>
>> If I recall correctly, there was response that it seemed  
>> interesting, but
>> was not further discussed.  Do people think it might be a viable  
>> approach to
>> resolve the issue?
>> ================
>>
>> As I understand it, and I agree, it might not solve all the issues  
>> (as it
>> stands, still thinking), but it does solve 2 types of issues:
>>
>> 1. combination of modern Jamos that do combine to a Hangul  
>> syllable, e.g.:
>>
>> U+1109;U+1161;U+11BC  =>  U+C0C1
>>
>> In this case, the use of <U+1109;U+1161;U+11BC> would effectively be
>> disallowed.
>>
>>
>> 2. combination of modern Jamos with old Jamos which combine to 1  
>> Hangul
>> syllable and 1 old Jamo, e.g.:
>>
>> U+1109;U+1161;U+11F0  =>  U+C0AC;U+11F0
>>
>> In this case, also, the use of <U+1109;U+1161;U+11F0> would be  
>> effectively
>> disallowed.
>>
>> It seems to me, if we are going to not disallow jamos, this would  
>> at least
>> be a measure to avoid some of the most obvious problems in the  
>> context of
>> IDN.
>>
>> The cases where no combination happens under KC are the cases which  
>> would
>> need further investigation.  It may be possible to add additional  
>> rules
>> based on the algorithms for displaying Hangul characters....?...
>>
>> =======
>>
>>> Mark Davis <*mark at macchiato.com* <mark at macchiato.com>> Wed, Oct  
>>> 15, 2008
>> at 5:56 AM
>>
>>
>>> That is, each of the Hangul precomposed syllables decomposes into  
>>> one or
>> two
>>
>>
>> one or two (wrong)---> two or three (correct) (Am I missing something
>> here?)
>>
>>
>>> combining jamo under NFD, and under NFC that sequence of combining  
>>> jamo
>> composes
>>> back into that syllable. The comparisons *do* work correctly,
>>> since IDNA labels have to be in NFC.
>>
>>
>> - Well, I would have to disagree with you.
>> Let me explain why the above claim is not correct.
>>
>>
>> - According to UCS (ISO/IEC 10646), each of the following three can
>> represent
>> Hangul syllable GGA:
>>
>> 1) UAC01 (GGA)
>>
>> 2) U1101 (GG), U1161 (A)
>>
>> 3) U1100 (G), U1100 (G), U1161 (A)
>>
>> - By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);
>> - However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will be  
>> changed to
>> U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.
>>
>>
>>> The comparisons *do* work correctly,
>>
>>
>> - ??? Isn't it considered comparison failure? (Am I missing something
>> here?)
>> - As we saw, NFC/NFD does not work correctly even for modern Hangul,
>> (not to mention Old Hangul)!
>>
>> [comment by another WG member:
>>
>> This is indeed the correct analysis. I find it very unfortunate
>> that U1101 (GG) does not have a *canonical* decomposition mapping
>> to <U1100 (G), U1100 (G)> (etc. for all the other multi-letter
>> Hangul Jamos). The Hangul script does NOT have a primitive Jamo
>> GG. The Hangul GG is, by design, composed of two G Jamos, just
>> like Latin GG is composed of two G letters.]
>>
>> [comment by another WG member:
>> Reading through section 3.12 of Unicode 5.0 is somewhat confusing,
>> because it tries to be very, very general for determining sylable
>> boundaries (virtually everything goes, as long as you can somehow
>> imagine that you might make a Korean syllable block out of it,
>> even if no such block ever has been made), whereas the descriptions
>> for canonical composition and decomposition are quite limited
>> (one block <=> two or three Jamo, depending on whether there is
>> a final consonant (group) or not). As an example, the sequence
>>
>> U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),
>> summarily written GGGGGA, would be a "Standard Korean syllable
>> block", too, the same way we would probably expect GGGGGA not to
>> be broken up by a hyphenation algorithm, whether it looks totally
>> silly (and in the Korean case, there's no way to display it as
>> a reasonably-looking syllable block) or not.]
>>
>> [comment by another WG member:
>> Well, it's very easy to take this position indeed. Also, it's
>> also possible to take the position that U+110F is the result of
>> adding a stroke to U+1100 (the equivalent, although in this day
>> and age much less clear, example would be that G is just a
>> Latin (in the true sense of the old Romans) C with a stroke or
>> hook added). The Korean script is so well designed that it's
>> difficult to know where to stop these decompositions.]
>>
>> ------------
>>
>> Note. For example, in XML 1.0 (fourth ed), 2) U1101 (GG), U1161 (A)  
>> is NOT
>> allowed since #x1101 is not included; in contrast, 3) U1100 (G),  
>> U1100 (G),
>> U1161 (A) IS allowed.
>>
>> (source :*http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar*<http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar 
>> >
>> )
>>
>> .#x1100 | [#x1102-#x1103] | [#x1105-#x1107] | #x1109 | [#x110B- 
>> #x110C] |
>>
>> [#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C | #x114E |  
>> #x1150 |
>>
>> [#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 | #x1165 |  
>> #x1167 |
>> #x1169 |
>>
>> [#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E | #x11A8 |  
>> #x11AB |
>>
>> [#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] |  
>> #x11EB |
>> #x11F0 |
>>
>> #x11F9 | ..
>>
>> KIM, Kyongsok
>> * I have been a chair of Korea JTC1/SC2 (a committee on Coded  
>> Character
>> Set) since 1993.
>> This committee represents Korea in ISO/IEC JTC1/SC2 which is in  
>> charge of
>> UCS (ISO/IEC 10646).
>>
>> ====
>> I would like to know where in ISO/IEC 10646 the type of sequence  
>> described
>> in 3 is 'allowed' to represent such Hangul syllables. Because to  
>> the best of
>> my knowledge it is not.
>> If it is not, the whole argument falls flat.
>> IN XML 1.1, the syllable itself GGA is already allowed in the same  
>> BaseChar
>> production list: "[#xAC00-#xD7A3]", so the Hangul syllable  
>> repertoire is
>> already covered w/o adding that sequence explicitly, and the  
>> syllable is the
>> NFC representation of the GGA syllable.
>>
>> Michel Suignard
>> (project editor for 10646)
>>
>> [Note by another WG member:
>> 10646 is rather silent on that mattar. But see the Unicode
>> standard. In version 5.0 this is discussed in section 3.12,
>> "Conjoining jamo behaviour". The key sentence there states:
>>
>> Unicode> Standard Korean syllable block: A sequence of one or more L
>> Unicode> followed by a sequence of one or more V and a sequence of  
>> zero
>> Unicode> or more T, or any other sequence that is canonically  
>> equivalent.]
>>
>>
>> =====
>>
>> Unicode> Standard Korean syllable block: A sequence of one or more L
>> Unicode> followed by a sequence of one or more V and a
>> sequence of zero >Unicode> or more T, or any other sequence
>> that is canonically equivalent.
>>
>> Reading through section 3.12 of Unicode 5.0 is somewhat confusing,
>> because it tries to be very, very general for determining sylable
>> boundaries (virtually everything goes, as long as you can somehow
>> immagine that you might make a Korean syllable block out of it,
>> even if no such block ever has been made),
>>
>> "no such block ever has been made" is not a consideration for an
>> alphabetic script, like Hangul. But there are practical limitations
>> of size in this case, since one tries (in display/print) to fit all
>> letters of a syllable into a graphical block the size of an  
>> ideograph.
>> Some "syllables", like GGGGGA in Hangul, would simply be too crammed
>> (unless the block size was gigantic). On the other hand, GGGGGA is
>> not a very reasonable "syllable" in text representing real (and
>> reasonably spelled) words.
>>
>> whereas the descriptions
>> for canonical composition and decomposition are quite limited
>> (one block <=> two or three Jamo, depending on whether there is
>> a final consonant (group) or not).
>>
>> Yes, but that is only a subset of the possible (and reasonable)
>> syllables that can be written in Hangul. It only covers (a superset
>> of) what occurs in "modern Hangul" (modulo the multiletter issue),
>> but has really nothing to do with how the Hangul script is  
>> constructed.
>>
>> As an example, the sequence
>>
>> U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),
>> summarily written GGGGGA, would be a "Standard Korean syllable
>> block", too, the same way we would probably expect GGGGGA not to
>> be broken up by a hyphenation algorithm, whether it looks totally
>> silly (and in the Korean case, there's no way to display it as
>> a reasonably-looking syllable block) or not.
>>
>>
>> KIM, Kyongsok wrote:
>> ... each of the following three can represent Hangul syllable GGA:
>> 1) UAC01 (GGA)
>> 2) U1101 (GG), U1161 (A)
>> 3) U1100 (G), U1100 (G), U1161 (A)
>> - By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);
>> - However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will
>> be changed to
>> U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.
>>
>> This is indeed the correct analysis. I find it very unfortunate
>> that U1101 (GG) does not have a *canonical* decomposition mapping
>> to <U1100 (G), U1100 (G)> (etc. for all the other multi-letter
>> Hangul Jamos). The Hangul script does NOT have a primitive Jamo
>> GG. The Hangul GG is, by design, composed of two G Jamos, just
>> like Latin GG is composed of two G letters.
>>
>> Well, it's very easy to take this position indeed.
>>
>> And is also how the Hangul script was actually designed.
>>
>> Also, it's
>> also possible to take the position that U+110F is the result of
>> adding a stroke to U+1100
>>
>> That is not how the Hangul script was designed, and is thus a
>> misinterpretation.
>>
>> (the equivalent, although in this day
>> and age much less clear, example would be that G is just a
>> Latin (in the true sense of the old Romans) C with a stroke or
>> hook added). The Korean script is so well designed that it's
>> difficult to know where to stop these decompositions.
>>
>> While that parallel holds for the strokes (dots originally, which
>> was a bit unfortunate graphically) for the vowels (for instance
>> Hangul O is NOT a composition of EU and ARAEA), and certain of the
>> consonants (e.g. THIEUTH is a primitive letter that happens to have
>> one more stroke than TIKEUT) it does not hold for the doubled  
>> consonants
>> (like SSANGKIYEOK) nor for any of the other multiletter Jamos (like
>> Hangul E *is* a composition of Hangul EO and Hangul I).
>>
>> The design documents, both editions, are quite clear on these  
>> matters.
>> So there is no reason to guess how the Hangul letters, and letter
>> combinations, are constructed. While one may find the philosophy
>> for the graphical design of the individual letters to sometimes be
>> a bit doubtful, esp. for the vowels, it is clear what are individual
>> letters, and what are compositions of letters.
>>
>> See the original design document, translated to English in
>>
>> The Korean Language, Ho-Min Sohn, Cambridge University Press, 1999,
>> ISBN 0-521-36123-0 or 0-521-36943-6. (Section 6.3 gives a translation
>> to English of the 1444 design document for the Hangul alphabet.)
>>
>> Also (facsimile only, no translation), in
>>
>> A history of Korean Alphabet and Movable Types, Ministry of Culture
>> and Information, Republic of Korea, 1970. (Part 1 reproduces the 1444
>> official design document for the Hangul alphabet.)
>>
>> The revised and extended Hangul design document, reproduced,
>> translated to English (and analysed) in:
>>
>> The Korean alphabet of 1446 – Expositions, OPA, The visible speech
>> sounds, Annotated translation, Future applicability; Hwun Min Ceng
>> Um, Sek Yen Kim-Cho, Humanity Books and AC Press, New York, 2002,
>> ISBN 89-428-1587-1. (Reproduces, translates and analyses (in English)
>> the 1446 official design document for the Hangul alphabet.)
>>
>> The extended document from 1446 introduces the kapyeoun-  
>> combinations as
>> compositions with IEUNG at the end. It is clear that the little  
>> circle
>> below is really a IEUNG, not something else.
>>
>> This book from 2002 also introduces an interesting possible extension
>> to Hangul, putting "annotations" on the (primitive) Hangul letters
>> within syllable blocks, for use as a phonetic notation.
>>
>> Also of relevance:
>>
>> The Korean Alphabet, its history and structure, ed. Young-Key
>> Kim-Renaud, University of Hawai'i Press, 1997, ISBN 0-824-81989-6.
>>
>>
>> =====
>>
>>
>> NOTE NEW BUSINESS ADDRESS AND PHONE
>> Vint Cerf
>> Google
>> 1818 Library Street, Suite 400
>> Reston, VA 20190
>> 202-370-5637
>> vint at google.com
>>
>>
>>
>>
>>
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update