Consensus Call Tranche 8 Summary - Addendum

Mark Davis mark at macchiato.com
Mon Oct 20 12:31:25 CEST 2008


We move forward just as we do for the other scripts; we don't exclude
non-modern characters, but registries are free to do that themselves.
Mark


On Mon, Oct 20, 2008 at 12:25 PM, Patrik Fältström <patrik at frobbit.se>wrote:

> On 20 okt 2008, at 11.57, Mark Davis wrote:
>
>  I think this subject has been seriously muddled by misinformation, thus
>> causing those who are not completely familiar with the way that Hangul is
>> encoded in Unicode to be mislead.
>>
>
> What confuses me is that the Koreans say there is a problem, and you say
> there is not a problem.
>
> Specifically I am confused of this first sentence of yours. No, I am not
> completely familiar with Hangul as I am not Korean, so of course I might be
> mislead.
>
> But the people from Korea are people I have to trust as they are the ones
> that use the language. And I take for granted you do not imply the people
> from Korea are mislead on how Hangul is encoded in Unicode?
>
> They say there is a problem. You say there is not any problem.
>
> How do we move forward?
>
>   Patrik
>
>  The proposed label step:
>>
>> toNFKC(toCaseFolded(toNFKC(label))) != label
>>
>>
>> is pointless, since case folding doesn't have any effect on Korean
>> characters (jamo or syllables) and any label is guaranteed to be in NFKC
>> format anyway, due to other provisions in Tables and Protocol. The
>> compatibility jamos are also not in question, since they also not in NFKC.
>>
>> The characters in question are all and only the following:
>>
>> The conjoining Jamos.
>>
>> U+1100 <http://unicode.org/cldr/utility/character.jsp?a=1100> HANGUL
>> CHOSEONG KIYEOK
>> …{88}…U+1159 <http://unicode.org/cldr/utility/character.jsp?a=1159>
>> HANGUL
>> CHOSEONG YEORINHIEUH
>> U+1161 <http://unicode.org/cldr/utility/character.jsp?a=1161> HANGUL
>> JUNGSEONG A
>> …{64}…U+11A2 <http://unicode.org/cldr/utility/character.jsp?a=11A2>
>> HANGUL
>> JUNGSEONG SSANGARAEA
>> U+11A8 <http://unicode.org/cldr/utility/character.jsp?a=11A8> HANGUL
>> JONGSEONG KIYEOK
>> …{80}…U+11F9 <http://unicode.org/cldr/utility/character.jsp?a=11F9>
>> HANGUL
>> JONGSEONG YEORINHIEUH
>>
>>
>> The Hangul Syllables.
>>
>> U+AC00 <http://unicode.org/cldr/utility/character.jsp?a=AC00> ( 가 )
>> HANGUL
>> SYLLABLE GA
>> …{11170}…U+D7A3 <http://unicode.org/cldr/utility/character.jsp?a=D7A3> (
>> 힣 )
>>
>> HANGUL SYLLABLE HIH
>>
>>
>> Any sequence of Jamo syllables that could correspond to a Hangul Syllable
>> according to the Unicode Standard canonical equivalence is transformed
>> into
>> it by the toNFKC function. Thus a sequence of Jamo syllables that could
>> correspond to a HS according to the Unicode Standard *cannot* be in an
>> IDNA2008 label according to Tables and Protocol.
>>
>> That's is meant by saying that there is no comparison problem. Anything
>> that
>> is equivalent to a HS according to TUS will already be a HS in an IDNA2008
>> label according to Tables and Protocol already.
>>
>> Now, one could have a contextual rule that forbade Jamo in situations
>> where
>> they could not be part of a valid syllable, and if people really wanted
>> that
>> we could do it.
>>
>> But frankly, it is not worth the effort. Unlike the case of the ZW
>> joiners,
>> these are not invisible characters; the worst that would happen is someone
>> would see nonsense on the screen -- but it is not our place to try to
>> forbid
>> nonsensical labels.
>>
>> This is clearly a case where the Korean NIC is free to narrow the set of
>> labels they accept to exclude non-modern characters, just as the German
>> NIC
>> is free to exclude archaic German characters, or the British NIC free to
>> exclude archaic English characters (like Þ or ð).
>>
>> Mark
>>
>>
>> On Mon, Oct 20, 2008 at 11:10 AM, Vint Cerf <vint at google.com> wrote:
>>
>>  Consensus Call Tranche 8 (Character Adjustments) - Addendum
>>>
>>> I neglected to summarize a number of messages relating to the JAMO
>>> discussion (they had subject fields that were specific to the JAMO
>>> discussion and did not appear when all the email was sorted with the
>>> original subject of the consensus call)
>>>
>>> As a result, the polling actually produced 9 YES and 8 NO - still clearly
>>> no final consensus.
>>>
>>> (8.c) Disallow conjoining Hangul jamo per recommendation from
>>> KRNIC and others, permitting only precomposed syllables.
>>>
>>> COMMENTS:
>>>
>>> I agree with the line of thought that we really should not disregard the
>>> results of the consensus position established by the most relevant
>>> language
>>> community after a rather extensive consensus process, so in general, I
>>> would
>>> side with the experts in Korea.
>>>
>>>
>>> Nevertheless, having been through this discussion for many times, I
>>> understand that there are opinions otherwise and am hoping to make a
>>> suggestion that could reconcile the lines of thought and be consistent
>>> with
>>> our architecture.  When we last discussed the issue of conjoining Hangul
>>> Jamo, I had suggested exploring the possibility of addressing them in the
>>> following manner:
>>>
>>>
>>> 1. categorize all Hangul Jamo as CONTEXTO
>>> 2. add stability contextual rule for these codepoints where the following
>>> must be true:
>>> toNFKC(toCaseFolded(toNFKC(label))) != label
>>>
>>>
>>>
>>>
>>> I am not familiar enough with Korean, but this might strike a graceful
>>> balance between disallowing conjoining jamo that forms a modern hangul
>>> and
>>> continue to allow archaic Jamo without creating too much of a
>>> confusion?...
>>>
>>>
>>> If I recall correctly, there was response that it seemed interesting, but
>>> was not further discussed.  Do people think it might be a viable approach
>>> to
>>> resolve the issue?
>>> ================
>>>
>>> As I understand it, and I agree, it might not solve all the issues (as it
>>> stands, still thinking), but it does solve 2 types of issues:
>>>
>>> 1. combination of modern Jamos that do combine to a Hangul syllable,
>>> e.g.:
>>>
>>> U+1109;U+1161;U+11BC  =>  U+C0C1
>>>
>>> In this case, the use of <U+1109;U+1161;U+11BC> would effectively be
>>> disallowed.
>>>
>>>
>>> 2. combination of modern Jamos with old Jamos which combine to 1 Hangul
>>> syllable and 1 old Jamo, e.g.:
>>>
>>> U+1109;U+1161;U+11F0  =>  U+C0AC;U+11F0
>>>
>>> In this case, also, the use of <U+1109;U+1161;U+11F0> would be
>>> effectively
>>> disallowed.
>>>
>>> It seems to me, if we are going to not disallow jamos, this would at
>>> least
>>> be a measure to avoid some of the most obvious problems in the context of
>>> IDN.
>>>
>>> The cases where no combination happens under KC are the cases which would
>>> need further investigation.  It may be possible to add additional rules
>>> based on the algorithms for displaying Hangul characters....?...
>>>
>>> =======
>>>
>>>  Mark Davis <*mark at macchiato.com* <mark at macchiato.com>> Wed, Oct 15,
>>>> 2008
>>>>
>>> at 5:56 AM
>>>
>>>
>>>  That is, each of the Hangul precomposed syllables decomposes into one or
>>>>
>>> two
>>>
>>>
>>> one or two (wrong)---> two or three (correct) (Am I missing something
>>> here?)
>>>
>>>
>>>  combining jamo under NFD, and under NFC that sequence of combining jamo
>>>>
>>> composes
>>>
>>>> back into that syllable. The comparisons *do* work correctly,
>>>> since IDNA labels have to be in NFC.
>>>>
>>>
>>>
>>> - Well, I would have to disagree with you.
>>> Let me explain why the above claim is not correct.
>>>
>>>
>>> - According to UCS (ISO/IEC 10646), each of the following three can
>>> represent
>>> Hangul syllable GGA:
>>>
>>> 1) UAC01 (GGA)
>>>
>>> 2) U1101 (GG), U1161 (A)
>>>
>>> 3) U1100 (G), U1100 (G), U1161 (A)
>>>
>>> - By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);
>>> - However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will be changed to
>>> U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.
>>>
>>>
>>>  The comparisons *do* work correctly,
>>>>
>>>
>>>
>>> - ??? Isn't it considered comparison failure? (Am I missing something
>>> here?)
>>> - As we saw, NFC/NFD does not work correctly even for modern Hangul,
>>> (not to mention Old Hangul)!
>>>
>>> [comment by another WG member:
>>>
>>> This is indeed the correct analysis. I find it very unfortunate
>>> that U1101 (GG) does not have a *canonical* decomposition mapping
>>> to <U1100 (G), U1100 (G)> (etc. for all the other multi-letter
>>> Hangul Jamos). The Hangul script does NOT have a primitive Jamo
>>> GG. The Hangul GG is, by design, composed of two G Jamos, just
>>> like Latin GG is composed of two G letters.]
>>>
>>> [comment by another WG member:
>>> Reading through section 3.12 of Unicode 5.0 is somewhat confusing,
>>> because it tries to be very, very general for determining sylable
>>> boundaries (virtually everything goes, as long as you can somehow
>>> imagine that you might make a Korean syllable block out of it,
>>> even if no such block ever has been made), whereas the descriptions
>>> for canonical composition and decomposition are quite limited
>>> (one block <=> two or three Jamo, depending on whether there is
>>> a final consonant (group) or not). As an example, the sequence
>>>
>>> U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),
>>> summarily written GGGGGA, would be a "Standard Korean syllable
>>> block", too, the same way we would probably expect GGGGGA not to
>>> be broken up by a hyphenation algorithm, whether it looks totally
>>> silly (and in the Korean case, there's no way to display it as
>>> a reasonably-looking syllable block) or not.]
>>>
>>> [comment by another WG member:
>>> Well, it's very easy to take this position indeed. Also, it's
>>> also possible to take the position that U+110F is the result of
>>> adding a stroke to U+1100 (the equivalent, although in this day
>>> and age much less clear, example would be that G is just a
>>> Latin (in the true sense of the old Romans) C with a stroke or
>>> hook added). The Korean script is so well designed that it's
>>> difficult to know where to stop these decompositions.]
>>>
>>> ------------
>>>
>>> Note. For example, in XML 1.0 (fourth ed), 2) U1101 (GG), U1161 (A) is
>>> NOT
>>> allowed since #x1101 is not included; in contrast, 3) U1100 (G), U1100
>>> (G),
>>> U1161 (A) IS allowed.
>>>
>>> (source :*http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar*<
>>> http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar>
>>>
>>> )
>>>
>>> .#x1100 | [#x1102-#x1103] | [#x1105-#x1107] | #x1109 | [#x110B-#x110C] |
>>>
>>> [#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150 |
>>>
>>> [#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 | #x1165 | #x1167 |
>>> #x1169 |
>>>
>>> [#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E | #x11A8 | #x11AB |
>>>
>>> [#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB |
>>> #x11F0 |
>>>
>>> #x11F9 | ..
>>>
>>> KIM, Kyongsok
>>> * I have been a chair of Korea JTC1/SC2 (a committee on Coded Character
>>> Set) since 1993.
>>> This committee represents Korea in ISO/IEC JTC1/SC2 which is in charge of
>>> UCS (ISO/IEC 10646).
>>>
>>> ====
>>> I would like to know where in ISO/IEC 10646 the type of sequence
>>> described
>>> in 3 is 'allowed' to represent such Hangul syllables. Because to the best
>>> of
>>> my knowledge it is not.
>>> If it is not, the whole argument falls flat.
>>> IN XML 1.1, the syllable itself GGA is already allowed in the same
>>> BaseChar
>>> production list: "[#xAC00-#xD7A3]", so the Hangul syllable repertoire is
>>> already covered w/o adding that sequence explicitly, and the syllable is
>>> the
>>> NFC representation of the GGA syllable.
>>>
>>> Michel Suignard
>>> (project editor for 10646)
>>>
>>> [Note by another WG member:
>>> 10646 is rather silent on that mattar. But see the Unicode
>>> standard. In version 5.0 this is discussed in section 3.12,
>>> "Conjoining jamo behaviour". The key sentence there states:
>>>
>>> Unicode> Standard Korean syllable block: A sequence of one or more L
>>> Unicode> followed by a sequence of one or more V and a sequence of zero
>>> Unicode> or more T, or any other sequence that is canonically
>>> equivalent.]
>>>
>>>
>>> =====
>>>
>>> Unicode> Standard Korean syllable block: A sequence of one or more L
>>> Unicode> followed by a sequence of one or more V and a
>>> sequence of zero >Unicode> or more T, or any other sequence
>>> that is canonically equivalent.
>>>
>>> Reading through section 3.12 of Unicode 5.0 is somewhat confusing,
>>> because it tries to be very, very general for determining sylable
>>> boundaries (virtually everything goes, as long as you can somehow
>>> immagine that you might make a Korean syllable block out of it,
>>> even if no such block ever has been made),
>>>
>>> "no such block ever has been made" is not a consideration for an
>>> alphabetic script, like Hangul. But there are practical limitations
>>> of size in this case, since one tries (in display/print) to fit all
>>> letters of a syllable into a graphical block the size of an ideograph.
>>> Some "syllables", like GGGGGA in Hangul, would simply be too crammed
>>> (unless the block size was gigantic). On the other hand, GGGGGA is
>>> not a very reasonable "syllable" in text representing real (and
>>> reasonably spelled) words.
>>>
>>> whereas the descriptions
>>> for canonical composition and decomposition are quite limited
>>> (one block <=> two or three Jamo, depending on whether there is
>>> a final consonant (group) or not).
>>>
>>> Yes, but that is only a subset of the possible (and reasonable)
>>> syllables that can be written in Hangul. It only covers (a superset
>>> of) what occurs in "modern Hangul" (modulo the multiletter issue),
>>> but has really nothing to do with how the Hangul script is constructed.
>>>
>>> As an example, the sequence
>>>
>>> U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),
>>> summarily written GGGGGA, would be a "Standard Korean syllable
>>> block", too, the same way we would probably expect GGGGGA not to
>>> be broken up by a hyphenation algorithm, whether it looks totally
>>> silly (and in the Korean case, there's no way to display it as
>>> a reasonably-looking syllable block) or not.
>>>
>>>
>>> KIM, Kyongsok wrote:
>>> ... each of the following three can represent Hangul syllable GGA:
>>> 1) UAC01 (GGA)
>>> 2) U1101 (GG), U1161 (A)
>>> 3) U1100 (G), U1100 (G), U1161 (A)
>>> - By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);
>>> - However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will
>>> be changed to
>>> U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.
>>>
>>> This is indeed the correct analysis. I find it very unfortunate
>>> that U1101 (GG) does not have a *canonical* decomposition mapping
>>> to <U1100 (G), U1100 (G)> (etc. for all the other multi-letter
>>> Hangul Jamos). The Hangul script does NOT have a primitive Jamo
>>> GG. The Hangul GG is, by design, composed of two G Jamos, just
>>> like Latin GG is composed of two G letters.
>>>
>>> Well, it's very easy to take this position indeed.
>>>
>>> And is also how the Hangul script was actually designed.
>>>
>>> Also, it's
>>> also possible to take the position that U+110F is the result of
>>> adding a stroke to U+1100
>>>
>>> That is not how the Hangul script was designed, and is thus a
>>> misinterpretation.
>>>
>>> (the equivalent, although in this day
>>> and age much less clear, example would be that G is just a
>>> Latin (in the true sense of the old Romans) C with a stroke or
>>> hook added). The Korean script is so well designed that it's
>>> difficult to know where to stop these decompositions.
>>>
>>> While that parallel holds for the strokes (dots originally, which
>>> was a bit unfortunate graphically) for the vowels (for instance
>>> Hangul O is NOT a composition of EU and ARAEA), and certain of the
>>> consonants (e.g. THIEUTH is a primitive letter that happens to have
>>> one more stroke than TIKEUT) it does not hold for the doubled consonants
>>> (like SSANGKIYEOK) nor for any of the other multiletter Jamos (like
>>> Hangul E *is* a composition of Hangul EO and Hangul I).
>>>
>>> The design documents, both editions, are quite clear on these matters.
>>> So there is no reason to guess how the Hangul letters, and letter
>>> combinations, are constructed. While one may find the philosophy
>>> for the graphical design of the individual letters to sometimes be
>>> a bit doubtful, esp. for the vowels, it is clear what are individual
>>> letters, and what are compositions of letters.
>>>
>>> See the original design document, translated to English in
>>>
>>> The Korean Language, Ho-Min Sohn, Cambridge University Press, 1999,
>>> ISBN 0-521-36123-0 or 0-521-36943-6. (Section 6.3 gives a translation
>>> to English of the 1444 design document for the Hangul alphabet.)
>>>
>>> Also (facsimile only, no translation), in
>>>
>>> A history of Korean Alphabet and Movable Types, Ministry of Culture
>>> and Information, Republic of Korea, 1970. (Part 1 reproduces the 1444
>>> official design document for the Hangul alphabet.)
>>>
>>> The revised and extended Hangul design document, reproduced,
>>> translated to English (and analysed) in:
>>>
>>> The Korean alphabet of 1446 – Expositions, OPA, The visible speech
>>> sounds, Annotated translation, Future applicability; Hwun Min Ceng
>>> Um, Sek Yen Kim-Cho, Humanity Books and AC Press, New York, 2002,
>>> ISBN 89-428-1587-1. (Reproduces, translates and analyses (in English)
>>> the 1446 official design document for the Hangul alphabet.)
>>>
>>> The extended document from 1446 introduces the kapyeoun- combinations as
>>> compositions with IEUNG at the end. It is clear that the little circle
>>> below is really a IEUNG, not something else.
>>>
>>> This book from 2002 also introduces an interesting possible extension
>>> to Hangul, putting "annotations" on the (primitive) Hangul letters
>>> within syllable blocks, for use as a phonetic notation.
>>>
>>> Also of relevance:
>>>
>>> The Korean Alphabet, its history and structure, ed. Young-Key
>>> Kim-Renaud, University of Hawai'i Press, 1997, ISBN 0-824-81989-6.
>>>
>>>
>>> =====
>>>
>>>
>>> NOTE NEW BUSINESS ADDRESS AND PHONE
>>> Vint Cerf
>>> Google
>>> 1818 Library Street, Suite 400
>>> Reston, VA 20190
>>> 202-370-5637
>>> vint at google.com
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Idna-update mailing list
>>> Idna-update at alvestrand.no
>>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>>
>>>
>>>  _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081020/00d6ea3d/attachment-0001.htm 


More information about the Idna-update mailing list