Consensus Call Tranche 8 Summary - Addendum

Mark Davis mark at macchiato.com
Mon Oct 20 11:57:50 CEST 2008


I think this subject has been seriously muddled by misinformation, thus
causing those who are not completely familiar with the way that Hangul is
encoded in Unicode to be mislead.
The proposed label step:

toNFKC(toCaseFolded(toNFKC(label))) != label


is pointless, since case folding doesn't have any effect on Korean
characters (jamo or syllables) and any label is guaranteed to be in NFKC
format anyway, due to other provisions in Tables and Protocol. The
compatibility jamos are also not in question, since they also not in NFKC.

The characters in question are all and only the following:

The conjoining Jamos.

U+1100 <http://unicode.org/cldr/utility/character.jsp?a=1100> HANGUL
CHOSEONG KIYEOK
…{88}…U+1159 <http://unicode.org/cldr/utility/character.jsp?a=1159> HANGUL
CHOSEONG YEORINHIEUH
U+1161 <http://unicode.org/cldr/utility/character.jsp?a=1161> HANGUL
JUNGSEONG A
…{64}…U+11A2 <http://unicode.org/cldr/utility/character.jsp?a=11A2> HANGUL
JUNGSEONG SSANGARAEA
U+11A8 <http://unicode.org/cldr/utility/character.jsp?a=11A8> HANGUL
JONGSEONG KIYEOK
…{80}…U+11F9 <http://unicode.org/cldr/utility/character.jsp?a=11F9> HANGUL
JONGSEONG YEORINHIEUH


The Hangul Syllables.

U+AC00 <http://unicode.org/cldr/utility/character.jsp?a=AC00> ( 가 ) HANGUL
SYLLABLE GA
…{11170}…U+D7A3 <http://unicode.org/cldr/utility/character.jsp?a=D7A3> ( 힣 )
HANGUL SYLLABLE HIH


Any sequence of Jamo syllables that could correspond to a Hangul Syllable
according to the Unicode Standard canonical equivalence is transformed into
it by the toNFKC function. Thus a sequence of Jamo syllables that could
correspond to a HS according to the Unicode Standard *cannot* be in an
IDNA2008 label according to Tables and Protocol.

That's is meant by saying that there is no comparison problem. Anything that
is equivalent to a HS according to TUS will already be a HS in an IDNA2008
label according to Tables and Protocol already.

Now, one could have a contextual rule that forbade Jamo in situations where
they could not be part of a valid syllable, and if people really wanted that
we could do it.

But frankly, it is not worth the effort. Unlike the case of the ZW joiners,
these are not invisible characters; the worst that would happen is someone
would see nonsense on the screen -- but it is not our place to try to forbid
nonsensical labels.

This is clearly a case where the Korean NIC is free to narrow the set of
labels they accept to exclude non-modern characters, just as the German NIC
is free to exclude archaic German characters, or the British NIC free to
exclude archaic English characters (like Þ or ð).

Mark


On Mon, Oct 20, 2008 at 11:10 AM, Vint Cerf <vint at google.com> wrote:

> Consensus Call Tranche 8 (Character Adjustments) - Addendum
>
> I neglected to summarize a number of messages relating to the JAMO
> discussion (they had subject fields that were specific to the JAMO
> discussion and did not appear when all the email was sorted with the
> original subject of the consensus call)
>
> As a result, the polling actually produced 9 YES and 8 NO - still clearly
> no final consensus.
>
> (8.c) Disallow conjoining Hangul jamo per recommendation from
> KRNIC and others, permitting only precomposed syllables.
>
> COMMENTS:
>
> I agree with the line of thought that we really should not disregard the
> results of the consensus position established by the most relevant language
> community after a rather extensive consensus process, so in general, I would
> side with the experts in Korea.
>
>
> Nevertheless, having been through this discussion for many times, I
> understand that there are opinions otherwise and am hoping to make a
> suggestion that could reconcile the lines of thought and be consistent with
> our architecture.  When we last discussed the issue of conjoining Hangul
> Jamo, I had suggested exploring the possibility of addressing them in the
> following manner:
>
>
> 1. categorize all Hangul Jamo as CONTEXTO
> 2. add stability contextual rule for these codepoints where the following
> must be true:
> toNFKC(toCaseFolded(toNFKC(label))) != label
>
>
>
>
> I am not familiar enough with Korean, but this might strike a graceful
> balance between disallowing conjoining jamo that forms a modern hangul and
> continue to allow archaic Jamo without creating too much of a confusion?...
>
>
> If I recall correctly, there was response that it seemed interesting, but
> was not further discussed.  Do people think it might be a viable approach to
> resolve the issue?
>  ================
>
> As I understand it, and I agree, it might not solve all the issues (as it
> stands, still thinking), but it does solve 2 types of issues:
>
> 1. combination of modern Jamos that do combine to a Hangul syllable, e.g.:
>
> U+1109;U+1161;U+11BC  =>  U+C0C1
>
> In this case, the use of <U+1109;U+1161;U+11BC> would effectively be
> disallowed.
>
>
> 2. combination of modern Jamos with old Jamos which combine to 1 Hangul
> syllable and 1 old Jamo, e.g.:
>
> U+1109;U+1161;U+11F0  =>  U+C0AC;U+11F0
>
> In this case, also, the use of <U+1109;U+1161;U+11F0> would be effectively
> disallowed.
>
> It seems to me, if we are going to not disallow jamos, this would at least
> be a measure to avoid some of the most obvious problems in the context of
> IDN.
>
> The cases where no combination happens under KC are the cases which would
> need further investigation.  It may be possible to add additional rules
> based on the algorithms for displaying Hangul characters....?...
>
> =======
>
> > Mark Davis <*mark at macchiato.com* <mark at macchiato.com>> Wed, Oct 15, 2008
> at 5:56 AM
>
>
> > That is, each of the Hangul precomposed syllables decomposes into one or
> two
>
>
> one or two (wrong)---> two or three (correct) (Am I missing something
> here?)
>
>
> > combining jamo under NFD, and under NFC that sequence of combining jamo
> composes
> > back into that syllable. The comparisons *do* work correctly,
> > since IDNA labels have to be in NFC.
>
>
> - Well, I would have to disagree with you.
> Let me explain why the above claim is not correct.
>
>
> - According to UCS (ISO/IEC 10646), each of the following three can
> represent
> Hangul syllable GGA:
>
> 1) UAC01 (GGA)
>
> 2) U1101 (GG), U1161 (A)
>
> 3) U1100 (G), U1100 (G), U1161 (A)
>
>  - By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);
>  - However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will be changed to
> U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.
>
>
>  > The comparisons *do* work correctly,
>
>
> - ??? Isn't it considered comparison failure? (Am I missing something
> here?)
> - As we saw, NFC/NFD does not work correctly even for modern Hangul,
> (not to mention Old Hangul)!
>
> [comment by another WG member:
>
> This is indeed the correct analysis. I find it very unfortunate
> that U1101 (GG) does not have a *canonical* decomposition mapping
> to <U1100 (G), U1100 (G)> (etc. for all the other multi-letter
> Hangul Jamos). The Hangul script does NOT have a primitive Jamo
> GG. The Hangul GG is, by design, composed of two G Jamos, just
> like Latin GG is composed of two G letters.]
>
> [comment by another WG member:
> Reading through section 3.12 of Unicode 5.0 is somewhat confusing,
> because it tries to be very, very general for determining sylable
> boundaries (virtually everything goes, as long as you can somehow
> imagine that you might make a Korean syllable block out of it,
> even if no such block ever has been made), whereas the descriptions
> for canonical composition and decomposition are quite limited
> (one block <=> two or three Jamo, depending on whether there is
> a final consonant (group) or not). As an example, the sequence
>
> U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),
> summarily written GGGGGA, would be a "Standard Korean syllable
> block", too, the same way we would probably expect GGGGGA not to
> be broken up by a hyphenation algorithm, whether it looks totally
> silly (and in the Korean case, there's no way to display it as
> a reasonably-looking syllable block) or not.]
>
> [comment by another WG member:
> Well, it's very easy to take this position indeed. Also, it's
> also possible to take the position that U+110F is the result of
> adding a stroke to U+1100 (the equivalent, although in this day
> and age much less clear, example would be that G is just a
> Latin (in the true sense of the old Romans) C with a stroke or
> hook added). The Korean script is so well designed that it's
> difficult to know where to stop these decompositions.]
>
> ------------
>
> Note. For example, in XML 1.0 (fourth ed), 2) U1101 (GG), U1161 (A) is NOT
> allowed since #x1101 is not included; in contrast, 3) U1100 (G), U1100 (G),
> U1161 (A) IS allowed.
>
> (source :*http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar*<http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar>
> )
>
> .#x1100 | [#x1102-#x1103] | [#x1105-#x1107] | #x1109 | [#x110B-#x110C] |
>
> [#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150 |
>
> [#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 | #x1165 | #x1167 |
> #x1169 |
>
> [#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E | #x11A8 | #x11AB |
>
> [#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB |
> #x11F0 |
>
> #x11F9 | ..
>
> KIM, Kyongsok
> * I have been a chair of Korea JTC1/SC2 (a committee on Coded Character
> Set) since 1993.
> This committee represents Korea in ISO/IEC JTC1/SC2 which is in charge of
> UCS (ISO/IEC 10646).
>
> ====
> I would like to know where in ISO/IEC 10646 the type of sequence described
> in 3 is 'allowed' to represent such Hangul syllables. Because to the best of
> my knowledge it is not.
> If it is not, the whole argument falls flat.
> IN XML 1.1, the syllable itself GGA is already allowed in the same BaseChar
> production list: "[#xAC00-#xD7A3]", so the Hangul syllable repertoire is
> already covered w/o adding that sequence explicitly, and the syllable is the
> NFC representation of the GGA syllable.
>
> Michel Suignard
> (project editor for 10646)
>
> [Note by another WG member:
> 10646 is rather silent on that mattar. But see the Unicode
> standard. In version 5.0 this is discussed in section 3.12,
> "Conjoining jamo behaviour". The key sentence there states:
>
> Unicode> Standard Korean syllable block: A sequence of one or more L
> Unicode> followed by a sequence of one or more V and a sequence of zero
> Unicode> or more T, or any other sequence that is canonically equivalent.]
>
>
> =====
>
> Unicode> Standard Korean syllable block: A sequence of one or more L
> Unicode> followed by a sequence of one or more V and a
> sequence of zero >Unicode> or more T, or any other sequence
> that is canonically equivalent.
>
> Reading through section 3.12 of Unicode 5.0 is somewhat confusing,
> because it tries to be very, very general for determining sylable
> boundaries (virtually everything goes, as long as you can somehow
> immagine that you might make a Korean syllable block out of it,
> even if no such block ever has been made),
>
> "no such block ever has been made" is not a consideration for an
> alphabetic script, like Hangul. But there are practical limitations
> of size in this case, since one tries (in display/print) to fit all
> letters of a syllable into a graphical block the size of an ideograph.
> Some "syllables", like GGGGGA in Hangul, would simply be too crammed
> (unless the block size was gigantic). On the other hand, GGGGGA is
> not a very reasonable "syllable" in text representing real (and
> reasonably spelled) words.
>
> whereas the descriptions
> for canonical composition and decomposition are quite limited
> (one block <=> two or three Jamo, depending on whether there is
> a final consonant (group) or not).
>
> Yes, but that is only a subset of the possible (and reasonable)
> syllables that can be written in Hangul. It only covers (a superset
> of) what occurs in "modern Hangul" (modulo the multiletter issue),
> but has really nothing to do with how the Hangul script is constructed.
>
> As an example, the sequence
>
> U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),
> summarily written GGGGGA, would be a "Standard Korean syllable
> block", too, the same way we would probably expect GGGGGA not to
> be broken up by a hyphenation algorithm, whether it looks totally
> silly (and in the Korean case, there's no way to display it as
> a reasonably-looking syllable block) or not.
>
>
> KIM, Kyongsok wrote:
> ... each of the following three can represent Hangul syllable GGA:
> 1) UAC01 (GGA)
> 2) U1101 (GG), U1161 (A)
> 3) U1100 (G), U1100 (G), U1161 (A)
>  - By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);
>  - However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will
> be changed to
> U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.
>
> This is indeed the correct analysis. I find it very unfortunate
> that U1101 (GG) does not have a *canonical* decomposition mapping
> to <U1100 (G), U1100 (G)> (etc. for all the other multi-letter
> Hangul Jamos). The Hangul script does NOT have a primitive Jamo
> GG. The Hangul GG is, by design, composed of two G Jamos, just
> like Latin GG is composed of two G letters.
>
> Well, it's very easy to take this position indeed.
>
> And is also how the Hangul script was actually designed.
>
> Also, it's
> also possible to take the position that U+110F is the result of
> adding a stroke to U+1100
>
> That is not how the Hangul script was designed, and is thus a
> misinterpretation.
>
> (the equivalent, although in this day
> and age much less clear, example would be that G is just a
> Latin (in the true sense of the old Romans) C with a stroke or
> hook added). The Korean script is so well designed that it's
> difficult to know where to stop these decompositions.
>
> While that parallel holds for the strokes (dots originally, which
> was a bit unfortunate graphically) for the vowels (for instance
> Hangul O is NOT a composition of EU and ARAEA), and certain of the
> consonants (e.g. THIEUTH is a primitive letter that happens to have
> one more stroke than TIKEUT) it does not hold for the doubled consonants
> (like SSANGKIYEOK) nor for any of the other multiletter Jamos (like
> Hangul E *is* a composition of Hangul EO and Hangul I).
>
> The design documents, both editions, are quite clear on these matters.
> So there is no reason to guess how the Hangul letters, and letter
> combinations, are constructed. While one may find the philosophy
> for the graphical design of the individual letters to sometimes be
> a bit doubtful, esp. for the vowels, it is clear what are individual
> letters, and what are compositions of letters.
>
> See the original design document, translated to English in
>
> The Korean Language, Ho-Min Sohn, Cambridge University Press, 1999,
> ISBN 0-521-36123-0 or 0-521-36943-6. (Section 6.3 gives a translation
> to English of the 1444 design document for the Hangul alphabet.)
>
> Also (facsimile only, no translation), in
>
> A history of Korean Alphabet and Movable Types, Ministry of Culture
> and Information, Republic of Korea, 1970. (Part 1 reproduces the 1444
> official design document for the Hangul alphabet.)
>
> The revised and extended Hangul design document, reproduced,
> translated to English (and analysed) in:
>
> The Korean alphabet of 1446 – Expositions, OPA, The visible speech
> sounds, Annotated translation, Future applicability; Hwun Min Ceng
> Um, Sek Yen Kim-Cho, Humanity Books and AC Press, New York, 2002,
> ISBN 89-428-1587-1. (Reproduces, translates and analyses (in English)
> the 1446 official design document for the Hangul alphabet.)
>
> The extended document from 1446 introduces the kapyeoun- combinations as
> compositions with IEUNG at the end. It is clear that the little circle
> below is really a IEUNG, not something else.
>
> This book from 2002 also introduces an interesting possible extension
> to Hangul, putting "annotations" on the (primitive) Hangul letters
> within syllable blocks, for use as a phonetic notation.
>
> Also of relevance:
>
> The Korean Alphabet, its history and structure, ed. Young-Key
> Kim-Renaud, University of Hawai'i Press, 1997, ISBN 0-824-81989-6.
>
>
> =====
>
>
> NOTE NEW BUSINESS ADDRESS AND PHONE
> Vint Cerf
> Google
> 1818 Library Street, Suite 400
> Reston, VA 20190
> 202-370-5637
> vint at google.com
>
>
>
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081020/b250cd33/attachment-0001.htm 


More information about the Idna-update mailing list