Consensus Call Tranche 8 Summary - Addendum

Martin Duerst duerst at it.aoyama.ac.jp
Tue Oct 21 10:21:13 CEST 2008


[I'm replying to this mail because of the formatting, but
mostly to Patrick.]

At 19:31 08/10/20, Mark Davis wrote:
>We move forward just as we do for the other scripts; we don't exclude non-modern characters, but registries are free to do that themselves.

Agreed.


>On Mon, Oct 20, 2008 at 12:25 PM, Patrik F$B%F!"(Bltstr$B%F%+(Bm <<mailto:patrik at frobbit.se>patrik at frobbit.se> wrote:  
>On 20 okt 2008, at 11.57, Mark Davis wrote:
>I think this subject has been seriously muddled by misinformation, thus 
>causing those who are not completely familiar with the way that Hangul is 
>encoded in Unicode to be mislead.
>
>
>What confuses me is that the Koreans say there is a problem, and you say there is not a problem.
>
>Specifically I am confused of this first sentence of yours. No, I am not completely familiar with Hangul as I am not Korean, so of course I might be mislead.
>
>But the people from Korea are people I have to trust as they are the ones that use the language. And I take for granted you do not imply the people from Korea are mislead on how Hangul is encoded in Unicode?
>
>They say there is a problem. You say there is not any problem.

The Hangul script, in contrast to most if not all scripts,
is very carefully designed at many levels (featural, 'letter',
syllable,...). That makes it very difficult to encode, because
there are so many choices. This is clearly visible in Unicode,
where we have Hangul Jamo, Hangul Syllables, Hangul Compatibility
Jamo, and Halfwidth Hangul Variants, and many of us remember the
moving around and expansion of Hangul Syllables in what's usually
termed the "Korean mess".

The fact that the most carefully and beautifully designed script
has met such a fate when being encoded may be seen as ironic or
tragic, but my guess is that the same would have happened to any
script with the same amount of structure. The more structure, the
more there are possibilities for encoding, and the more there are
possibilities for disagreement.

However, since the moving around and expansion of the Hangul Syllables,
the encoding model of Unicode has remained stable for years, and
the additions currently planned/scheduled with Amendment 5 to
ISO/IEC 10646 only confirm this. The encoding model has two
levels:

- Hangul Syllables: There is a full set of 11,000 or so precomposed
  Hangul Syllables, containing the combinations of all
  leading consonant groups, middle vowel groups, and (optional)
  final consonant groups in modern use.

- Hangul Jamo: Three lists of leading consonant groups, middle vowel
  groups, and final consonant groups, including both those groups
  in modern use as well as those in historic use only. 
  Each Hangul Syllable (modern or historic) can be written as a
  combination of two (leading/vowel) or three (leading/vowel/final)
  Hangul Jamo.


As far as I understand, there are two points of disagreement:

- NFC combines a leading and a vowel group into a Hangul Syllable
  even if followed by a historic consonant group. For some kinds
  of processing, it is more convenient to have things all-decomposed
  once you start to decompose (mind you, for certain kinds of processing,
  it is more convenient to decompose everything even if your
  data is Hangul only). There is no problem here for IDNs because
  NFC is unambigous, well defined and widely implemented.
  (NFC is also shorter than a pure Jamo representation.)

- Each Hangul Jamo is not necessarily one letter, but can be
  a combination of letters. For final consonant groups, there are
  e.g. ks, nc, nh, rk, rm, rp, rh, and ps combinations even just
  for modern usage (although, as far as I understand, these are
  no longer spoken the way they are written). Both in leading
  and final position, there are also consonant doubles.
  There is some disagreement about what to do with sequences
  of Jamo that can be seen as corresponding to another Jamo.
  Unicode, since 2.1.9, treats these all as separate. However,
  there is no disagreement on the fact that these should be
  treated as combining on input, because keyboards will not
  contain all combinations. Also, it is unclear that applications
  will automatically display piecemeal Jamos and syllables
  containing only two or three Jamos (one each for leading,
  vowel, and potentially final) in the same way. For current,
  widely used applications such as MS Word or browsers, this
  isn't the case, and they also don't support display of
  historical Korean yet. (The historical perspective that
  Kent has given is one important aspect, but in general and
  in this case, historic considerations are only one part of
  character encoding decisions.)

Both of the above disagreements are independent of whether Jamo
are protocol-valid or disallowed. Everybody agrees that for
historic stuff, you need Jamo, and for modern stuff, you don't,
and everybody agrees that KRNIC does, can and should exclude
Jamo unless and until they want to offer historic Hangul domain
names.

So in conclusion, yes, there may be problems (or there may
be not), but they are not relevant for our decision at hand.


I think there is some validity in the comment by Vaggelis
that .gr (for Greek, and by extension, .kr for Hangul) may
easily get things correct, but that this may not apply to
other registries.


One idea I just had was to create a category HISTORIC.
While this category would be equivalent to PROTOCOL-VALID
for the protocol, it would clearly give some information
to registries out there. Because it would not mean any
decision with regards to protocol, it might be easier
for us to come forward with some guidelines on what
to put into HISTORIC, easier than it was with MAYBE
and friends. Just an idea. And it wouldn't help for
the Greek final Sigma, but it would send a signal for
Hangul and other scripts.

Regards,    Martin.



>How do we move forward?
>
>  Patrik
>
>The proposed label step:
>toNFKC(toCaseFolded(toNFKC(label))) != label
>
>is pointless, since case folding doesn't have any effect on Korean 
>characters (jamo or syllables) and any label is guaranteed to be in NFKC 
>format anyway, due to other provisions in Tables and Protocol. The 
>compatibility jamos are also not in question, since they also not in NFKC.
>The characters in question are all and only the following:
>The conjoining Jamos.
>U+1100 <<http://unicode.org/cldr/utility/character.jsp?a=1100>http://unicode.org/cldr/utility/character.jsp?a=1100> HANGUL 
>CHOSEONG KIYEOK 
>$Bc`%r(B{88}$Bc`%r(BU+1159 <<http://unicode.org/cldr/utility/character.jsp>http://unicode.org/cldr/utility/character.jsp?a=1159> HANGUL 
>CHOSEONG YEORINHIEUH 
>U+1161 <<http://unicode.org/cldr/utility/character.jsp?a=1161>http://unicode.org/cldr/utility/character.jsp?a=1161> HANGUL 
>JUNGSEONG A 
>$Bc`%r(B{64}$Bc`%r(BU+11A2 <<http://unicode.org/cldr/utility/character.jsp>http://unicode.org/cldr/utility/character.jsp?a=11A2> HANGUL 
>JUNGSEONG SSANGARAEA 
>U+11A8 <<http://unicode.org/cldr/utility/character.jsp?a=11A8>http://unicode.org/cldr/utility/character.jsp?a=11A8> HANGUL 
>JONGSEONG KIYEOK 
>$Bc`%r(B{80}$Bc`%r(BU+11F9 <<http://unicode.org/cldr/utility/character.jsp>http://unicode.org/cldr/utility/character.jsp?a=11F9> HANGUL
>JONGSEONG YEORINHIEUH
>
>The Hangul Syllables.
>U+AC00 <<http://unicode.org/cldr/utility/character.jsp?a=AC00>http://unicode.org/cldr/utility/character.jsp?a=AC00> ( $Bt2(B€ ) HANGUL 
>SYLLABLE GA 
>$Bc`%r(B{11170}$Bc`%r(BU+D7A3 <<http://unicode.org/cldr/utility/character.jsp>http://unicode.org/cldr/utility/character.jsp?a=D7A3> ( $By~!W(B )
>HANGUL SYLLABLE HIH
>
>Any sequence of Jamo syllables that could correspond to a Hangul Syllable 
>according to the Unicode Standard canonical equivalence is transformed into 
>it by the toNFKC function. Thus a sequence of Jamo syllables that could 
>correspond to a HS according to the Unicode Standard *cannot* be in an 
>IDNA2008 label according to Tables and Protocol.
>That's is meant by saying that there is no comparison problem. Anything that 
>is equivalent to a HS according to TUS will already be a HS in an IDNA2008 
>label according to Tables and Protocol already.
>Now, one could have a contextual rule that forbade Jamo in situations where 
>they could not be part of a valid syllable, and if people really wanted that 
>we could do it.
>But frankly, it is not worth the effort. Unlike the case of the ZW joiners, 
>these are not invisible characters; the worst that would happen is someone 
>would see nonsense on the screen -- but it is not our place to try to forbid 
>nonsensical labels.
>This is clearly a case where the Korean NIC is free to narrow the set of 
>labels they accept to exclude non-modern characters, just as the German NIC 
>is free to exclude archaic German characters, or the British NIC free to 
>exclude archaic English characters (like $B%F[(Bor $B%F!<(B).
>Mark
>
>On Mon, Oct 20, 2008 at 11:10 AM, Vint Cerf <<mailto:vint at google.com>vint at google.com> wrote:
>Consensus Call Tranche 8 (Character Adjustments) - Addendum
>I neglected to summarize a number of messages relating to the JAMO 
>discussion (they had subject fields that were specific to the JAMO 
>discussion and did not appear when all the email was sorted with the 
>original subject of the consensus call)
>As a result, the polling actually produced 9 YES and 8 NO - still clearly 
>no final consensus.
>(8.c) Disallow conjoining Hangul jamo per recommendation from 
>KRNIC and others, permitting only precomposed syllables.
>COMMENTS:
>I agree with the line of thought that we really should not disregard the 
>results of the consensus position established by the most relevant language 
>community after a rather extensive consensus process, so in general, I would 
>side with the experts in Korea.
>
>Nevertheless, having been through this discussion for many times, I 
>understand that there are opinions otherwise and am hoping to make a 
>suggestion that could reconcile the lines of thought and be consistent with 
>our architecture.  When we last discussed the issue of conjoining Hangul 
>Jamo, I had suggested exploring the possibility of addressing them in the 
>following manner:
>
>1. categorize all Hangul Jamo as CONTEXTO 
>2. add stability contextual rule for these codepoints where the following 
>must be true: 
>toNFKC(toCaseFolded(toNFKC(label))) != label
>
>
>
>I am not familiar enough with Korean, but this might strike a graceful 
>balance between disallowing conjoining jamo that forms a modern hangul and 
>continue to allow archaic Jamo without creating too much of a confusion?...
>
>If I recall correctly, there was response that it seemed interesting, but 
>was not further discussed.  Do people think it might be a viable approach to 
>resolve the issue? 
>================
>As I understand it, and I agree, it might not solve all the issues (as it 
>stands, still thinking), but it does solve 2 types of issues:
>1. combination of modern Jamos that do combine to a Hangul syllable, e.g.:
>U+1109;U+1161;U+11BC  =>  U+C0C1
>In this case, the use of <U+1109;U+1161;U+11BC> would effectively be 
>disallowed.
>
>2. combination of modern Jamos with old Jamos which combine to 1 Hangul 
>syllable and 1 old Jamo, e.g.:
>U+1109;U+1161;U+11F0  =>  U+C0AC;U+11F0
>In this case, also, the use of <U+1109;U+1161;U+11F0> would be effectively 
>disallowed.
>It seems to me, if we are going to not disallow jamos, this would at least 
>be a measure to avoid some of the most obvious problems in the context of 
>IDN.
>The cases where no combination happens under KC are the cases which would 
>need further investigation.  It may be possible to add additional rules 
>based on the algorithms for displaying Hangul characters....?...
>=======
>Mark Davis <*<mailto:mark at macchiato.com>mark at macchiato.com* <<mailto:mark at macchiato.com>mark at macchiato.com>> Wed, Oct 15, 2008
>
>at 5:56 AM
>
>That is, each of the Hangul precomposed syllables decomposes into one or
>
>two
>
>
>one or two (wrong)---> two or three (correct) (Am I missing something
>here?)
>
>combining jamo under NFD, and under NFC that sequence of combining jamo
>
>composes  
>back into that syllable. The comparisons *do* work correctly, 
>since IDNA labels have to be in NFC.
>
>
>
>- Well, I would have to disagree with you.
>Let me explain why the above claim is not correct.
>
>
>- According to UCS (ISO/IEC 10646), each of the following three can
>represent
>Hangul syllable GGA:
>
>1) UAC01 (GGA)
>
>2) U1101 (GG), U1161 (A)
>
>3) U1100 (G), U1100 (G), U1161 (A)
>
>- By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);
>- However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will be changed to
>U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.
>
>The comparisons *do* work correctly,
>
>
>
>- ??? Isn't it considered comparison failure? (Am I missing something
>here?)
>- As we saw, NFC/NFD does not work correctly even for modern Hangul,
>(not to mention Old Hangul)!
>
>[comment by another WG member:
>
>This is indeed the correct analysis. I find it very unfortunate
>that U1101 (GG) does not have a *canonical* decomposition mapping
>to <U1100 (G), U1100 (G)> (etc. for all the other multi-letter
>Hangul Jamos). The Hangul script does NOT have a primitive Jamo
>GG. The Hangul GG is, by design, composed of two G Jamos, just
>like Latin GG is composed of two G letters.]
>
>[comment by another WG member:
>Reading through section 3.12 of Unicode 5.0 is somewhat confusing,
>because it tries to be very, very general for determining sylable
>boundaries (virtually everything goes, as long as you can somehow
>imagine that you might make a Korean syllable block out of it,
>even if no such block ever has been made), whereas the descriptions
>for canonical composition and decomposition are quite limited
>(one block <=> two or three Jamo, depending on whether there is
>a final consonant (group) or not). As an example, the sequence
>
>U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),
>summarily written GGGGGA, would be a "Standard Korean syllable
>block", too, the same way we would probably expect GGGGGA not to
>be broken up by a hyphenation algorithm, whether it looks totally
>silly (and in the Korean case, there's no way to display it as
>a reasonably-looking syllable block) or not.]
>
>[comment by another WG member:
>Well, it's very easy to take this position indeed. Also, it's
>also possible to take the position that U+110F is the result of
>adding a stroke to U+1100 (the equivalent, although in this day
>and age much less clear, example would be that G is just a
>Latin (in the true sense of the old Romans) C with a stroke or
>hook added). The Korean script is so well designed that it's
>difficult to know where to stop these decompositions.]
>
>------------
>
>Note. For example, in XML 1.0 (fourth ed), 2) U1101 (GG), U1161 (A) is NOT
>allowed since #x1101 is not included; in contrast, 3) U1100 (G), U1100 (G),
>U1161 (A) IS allowed.
>
>(source :*<http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar*>http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar*<<http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar>http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar>
>
>)
>
>.#x1100 | [#x1102-#x1103] | [#x1105-#x1107] | #x1109 | [#x110B-#x110C] |
>
>[#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150 |
>
>[#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 | #x1165 | #x1167 |
>#x1169 |
>
>[#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E | #x11A8 | #x11AB |
>
>[#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB |
>#x11F0 |
>
>#x11F9 | ..
>
>KIM, Kyongsok
>* I have been a chair of Korea JTC1/SC2 (a committee on Coded Character
>Set) since 1993.
>This committee represents Korea in ISO/IEC JTC1/SC2 which is in charge of
>UCS (ISO/IEC 10646).
>
>====
>I would like to know where in ISO/IEC 10646 the type of sequence described
>in 3 is 'allowed' to represent such Hangul syllables. Because to the best of
>my knowledge it is not.
>If it is not, the whole argument falls flat.
>IN XML 1.1, the syllable itself GGA is already allowed in the same BaseChar
>production list: "[#xAC00-#xD7A3]", so the Hangul syllable repertoire is
>already covered w/o adding that sequence explicitly, and the syllable is the
>NFC representation of the GGA syllable.
>
>Michel Suignard
>(project editor for 10646)
>
>[Note by another WG member:
>10646 is rather silent on that mattar. But see the Unicode
>standard. In version 5.0 this is discussed in section 3.12,
>"Conjoining jamo behaviour". The key sentence there states:
>
>Unicode> Standard Korean syllable block: A sequence of one or more L
>Unicode> followed by a sequence of one or more V and a sequence of zero
>Unicode> or more T, or any other sequence that is canonically equivalent.]
>
>
>=====
>
>Unicode> Standard Korean syllable block: A sequence of one or more L
>Unicode> followed by a sequence of one or more V and a
>sequence of zero >Unicode> or more T, or any other sequence
>that is canonically equivalent.
>
>Reading through section 3.12 of Unicode 5.0 is somewhat confusing,
>because it tries to be very, very general for determining sylable
>boundaries (virtually everything goes, as long as you can somehow
>immagine that you might make a Korean syllable block out of it,
>even if no such block ever has been made),
>
>"no such block ever has been made" is not a consideration for an
>alphabetic script, like Hangul. But there are practical limitations
>of size in this case, since one tries (in display/print) to fit all
>letters of a syllable into a graphical block the size of an ideograph.
>Some "syllables", like GGGGGA in Hangul, would simply be too crammed
>(unless the block size was gigantic). On the other hand, GGGGGA is
>not a very reasonable "syllable" in text representing real (and
>reasonably spelled) words.
>
>whereas the descriptions
>for canonical composition and decomposition are quite limited
>(one block <=> two or three Jamo, depending on whether there is
>a final consonant (group) or not).
>
>Yes, but that is only a subset of the possible (and reasonable)
>syllables that can be written in Hangul. It only covers (a superset
>of) what occurs in "modern Hangul" (modulo the multiletter issue),
>but has really nothing to do with how the Hangul script is constructed.
>
>As an example, the sequence
>
>U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),
>summarily written GGGGGA, would be a "Standard Korean syllable
>block", too, the same way we would probably expect GGGGGA not to
>be broken up by a hyphenation algorithm, whether it looks totally
>silly (and in the Korean case, there's no way to display it as
>a reasonably-looking syllable block) or not.
>
>
>KIM, Kyongsok wrote:
>... each of the following three can represent Hangul syllable GGA:
>1) UAC01 (GGA)
>2) U1101 (GG), U1161 (A)
>3) U1100 (G), U1100 (G), U1161 (A)
>- By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);
>- However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will
>be changed to
>U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.
>
>This is indeed the correct analysis. I find it very unfortunate
>that U1101 (GG) does not have a *canonical* decomposition mapping
>to <U1100 (G), U1100 (G)> (etc. for all the other multi-letter
>Hangul Jamos). The Hangul script does NOT have a primitive Jamo
>GG. The Hangul GG is, by design, composed of two G Jamos, just
>like Latin GG is composed of two G letters.
>
>Well, it's very easy to take this position indeed.
>
>And is also how the Hangul script was actually designed.
>
>Also, it's
>also possible to take the position that U+110F is the result of
>adding a stroke to U+1100
>
>That is not how the Hangul script was designed, and is thus a
>misinterpretation.
>
>(the equivalent, although in this day
>and age much less clear, example would be that G is just a
>Latin (in the true sense of the old Romans) C with a stroke or
>hook added). The Korean script is so well designed that it's
>difficult to know where to stop these decompositions.
>
>While that parallel holds for the strokes (dots originally, which
>was a bit unfortunate graphically) for the vowels (for instance
>Hangul O is NOT a composition of EU and ARAEA), and certain of the
>consonants (e.g. THIEUTH is a primitive letter that happens to have
>one more stroke than TIKEUT) it does not hold for the doubled consonants
>(like SSANGKIYEOK) nor for any of the other multiletter Jamos (like
>Hangul E *is* a composition of Hangul EO and Hangul I).
>
>The design documents, both editions, are quite clear on these matters.
>So there is no reason to guess how the Hangul letters, and letter
>combinations, are constructed. While one may find the philosophy
>for the graphical design of the individual letters to sometimes be
>a bit doubtful, esp. for the vowels, it is clear what are individual
>letters, and what are compositions of letters.
>
>See the original design document, translated to English in
>
>The Korean Language, Ho-Min Sohn, Cambridge University Press, 1999,
>ISBN 0-521-36123-0 or 0-521-36943-6. (Section 6.3 gives a translation
>to English of the 1444 design document for the Hangul alphabet.)
>
>Also (facsimile only, no translation), in
>
>A history of Korean Alphabet and Movable Types, Ministry of Culture
>and Information, Republic of Korea, 1970. (Part 1 reproduces the 1444
>official design document for the Hangul alphabet.)
>
>The revised and extended Hangul design document, reproduced,
>translated to English (and analysed) in:
>
>The Korean alphabet of 1446 $Bc`E(BExpositions, OPA, The visible speech
>sounds, Annotated translation, Future applicability; Hwun Min Ceng
>Um, Sek Yen Kim-Cho, Humanity Books and AC Press, New York, 2002,
>ISBN 89-428-1587-1. (Reproduces, translates and analyses (in English)
>the 1446 official design document for the Hangul alphabet.)
>
>The extended document from 1446 introduces the kapyeoun- combinations as
>compositions with IEUNG at the end. It is clear that the little circle
>below is really a IEUNG, not something else.
>
>This book from 2002 also introduces an interesting possible extension
>to Hangul, putting "annotations" on the (primitive) Hangul letters
>within syllable blocks, for use as a phonetic notation.
>
>Also of relevance:
>
>The Korean Alphabet, its history and structure, ed. Young-Key
>Kim-Renaud, University of Hawai'i Press, 1997, ISBN 0-824-81989-6.
>
>
>=====
>
>
>NOTE NEW BUSINESS ADDRESS AND PHONE
>Vint Cerf
>Google
>1818 Library Street, Suite 400
>Reston, VA 20190
>202-370-5637
><mailto:vint at google.com>vint at google.com
>
>
>
>
>
>_______________________________________________
>Idna-update mailing list
><mailto:Idna-update at alvestrand.no>Idna-update at alvestrand.no
>http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
>_______________________________________________
>Idna-update mailing list
><mailto:Idna-update at alvestrand.no>Idna-update at alvestrand.no
>http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
>
>_______________________________________________
>Idna-update mailing list
>Idna-update at alvestrand.no
>http://www.alvestrand.no/mailman/listinfo/idna-update


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp     



More information about the Idna-update mailing list