Consensus Call Tranche 8 Summary - Addendum

Vint Cerf vint at google.com
Mon Oct 20 11:10:49 CEST 2008


Consensus Call Tranche 8 (Character Adjustments) - Addendum

I neglected to summarize a number of messages relating to the JAMO  
discussion (they had subject fields that were specific to the JAMO  
discussion and did not appear when all the email was sorted with the  
original subject of the consensus call)

As a result, the polling actually produced 9 YES and 8 NO - still  
clearly no final consensus.

(8.c) Disallow conjoining Hangul jamo per recommendation from
KRNIC and others, permitting only precomposed syllables.

COMMENTS:

I agree with the line of thought that we really should not disregard  
the results of the consensus position established by the most  
relevant language community after a rather extensive consensus  
process, so in general, I would side with the experts in Korea.

Nevertheless, having been through this discussion for many times, I  
understand that there are opinions otherwise and am hoping to make a  
suggestion that could reconcile the lines of thought and be  
consistent with our architecture.  When we last discussed the issue  
of conjoining Hangul Jamo, I had suggested exploring the possibility  
of addressing them in the following manner:

1. categorize all Hangul Jamo as CONTEXTO
2. add stability contextual rule for these codepoints where the  
following must be true:
toNFKC(toCaseFolded(toNFKC(label))) != label


I am not familiar enough with Korean, but this might strike a  
graceful balance between disallowing conjoining jamo that forms a  
modern hangul and continue to allow archaic Jamo without creating too  
much of a confusion?...

If I recall correctly, there was response that it seemed interesting,  
but was not further discussed.  Do people think it might be a viable  
approach to resolve the issue?
  ================

As I understand it, and I agree, it might not solve all the issues  
(as it stands, still thinking), but it does solve 2 types of issues:

1. combination of modern Jamos that do combine to a Hangul syllable,  
e.g.:

U+1109;U+1161;U+11BC  =>  U+C0C1

In this case, the use of <U+1109;U+1161;U+11BC> would effectively be  
disallowed.


2. combination of modern Jamos with old Jamos which combine to 1  
Hangul syllable and 1 old Jamo, e.g.:

U+1109;U+1161;U+11F0  =>  U+C0AC;U+11F0

In this case, also, the use of <U+1109;U+1161;U+11F0> would be  
effectively disallowed.

It seems to me, if we are going to not disallow jamos, this would at  
least be a measure to avoid some of the most obvious problems in the  
context of IDN.

The cases where no combination happens under KC are the cases which  
would need further investigation.  It may be possible to add  
additional rules based on the algorithms for displaying Hangul  
characters....?...

=======

 > Mark Davis <mark at macchiato.com> Wed, Oct 15, 2008 at 5:56 AM

 > That is, each of the Hangul precomposed syllables decomposes into  
one or two

one or two (wrong)---> two or three (correct) (Am I missing something  
here?)

 > combining jamo under NFD, and under NFC that sequence of combining  
jamo composes
 > back into that syllable. The comparisons *do* work correctly,
 > since IDNA labels have to be in NFC.

- Well, I would have to disagree with you.
Let me explain why the above claim is not correct.

- According to UCS (ISO/IEC 10646), each of the following three can  
represent
Hangul syllable GGA:
1) UAC01 (GGA)

2) U1101 (GG), U1161 (A)

3) U1100 (G), U1100 (G), U1161 (A)

  - By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);

  - However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will be  
changed to
U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.

  > The comparisons *do* work correctly,

- ??? Isn't it considered comparison failure? (Am I missing something  
here?)
- As we saw, NFC/NFD does not work correctly even for modern Hangul,
(not to mention Old Hangul)!

[comment by another WG member:

This is indeed the correct analysis. I find it very unfortunate
that U1101 (GG) does not have a *canonical* decomposition mapping
to <U1100 (G), U1100 (G)> (etc. for all the other multi-letter
Hangul Jamos). The Hangul script does NOT have a primitive Jamo
GG. The Hangul GG is, by design, composed of two G Jamos, just
like Latin GG is composed of two G letters.]

[comment by another WG member:
Reading through section 3.12 of Unicode 5.0 is somewhat confusing,
because it tries to be very, very general for determining sylable
boundaries (virtually everything goes, as long as you can somehow
imagine that you might make a Korean syllable block out of it,
even if no such block ever has been made), whereas the descriptions
for canonical composition and decomposition are quite limited
(one block <=> two or three Jamo, depending on whether there is
a final consonant (group) or not). As an example, the sequence

U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),
summarily written GGGGGA, would be a "Standard Korean syllable
block", too, the same way we would probably expect GGGGGA not to
be broken up by a hyphenation algorithm, whether it looks totally
silly (and in the Korean case, there's no way to display it as
a reasonably-looking syllable block) or not.]

[comment by another WG member:
Well, it's very easy to take this position indeed. Also, it's
also possible to take the position that U+110F is the result of
adding a stroke to U+1100 (the equivalent, although in this day
and age much less clear, example would be that G is just a
Latin (in the true sense of the old Romans) C with a stroke or
hook added). The Korean script is so well designed that it's
difficult to know where to stop these decompositions.]

------------

Note. For example, in XML 1.0 (fourth ed), 2) U1101 (GG), U1161 (A)  
is NOT allowed since #x1101 is not included; in contrast, 3) U1100  
(G), U1100 (G), U1161 (A) IS allowed.

(source :http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar)

.#x1100 | [#x1102-#x1103] | [#x1105-#x1107] | #x1109 | [#x110B-#x110C] |

[#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150 |

[#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 | #x1165 | #x1167  
| #x1169 |

[#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E | #x11A8 | #x11AB |

[#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB  
| #x11F0 |

#x11F9 | ..

KIM, Kyongsok

* I have been a chair of Korea JTC1/SC2 (a committee on Coded  
Character Set) since 1993.
This committee represents Korea in ISO/IEC JTC1/SC2 which is in  
charge of UCS (ISO/IEC 10646).

====
I would like to know where in ISO/IEC 10646 the type of sequence  
described in 3 is ‘allowed’ to represent such Hangul syllables.  
Because to the best of my knowledge it is not.
If it is not, the whole argument falls flat.
IN XML 1.1, the syllable itself GGA is already allowed in the same  
BaseChar production list: “[#xAC00-#xD7A3]”, so the Hangul syllable  
repertoire is already covered w/o adding that sequence explicitly,  
and the syllable is the NFC representation of the GGA syllable.

Michel Suignard
(project editor for 10646)

[Note by another WG member:
10646 is rather silent on that mattar. But see the Unicode
standard. In version 5.0 this is discussed in section 3.12,
"Conjoining jamo behaviour". The key sentence there states:

Unicode> Standard Korean syllable block: A sequence of one or more L
Unicode> followed by a sequence of one or more V and a sequence of zero
Unicode> or more T, or any other sequence that is canonically  
equivalent.]


=====

Unicode> Standard Korean syllable block: A sequence of one or more L
Unicode> followed by a sequence of one or more V and a
sequence of zero >Unicode> or more T, or any other sequence
that is canonically equivalent.

Reading through section 3.12 of Unicode 5.0 is somewhat confusing,
because it tries to be very, very general for determining sylable
boundaries (virtually everything goes, as long as you can somehow
immagine that you might make a Korean syllable block out of it,
even if no such block ever has been made),

"no such block ever has been made" is not a consideration for an
alphabetic script, like Hangul. But there are practical limitations
of size in this case, since one tries (in display/print) to fit all
letters of a syllable into a graphical block the size of an ideograph.
Some "syllables", like GGGGGA in Hangul, would simply be too crammed
(unless the block size was gigantic). On the other hand, GGGGGA is
not a very reasonable "syllable" in text representing real (and
reasonably spelled) words.

whereas the descriptions
for canonical composition and decomposition are quite limited
(one block <=> two or three Jamo, depending on whether there is
a final consonant (group) or not).

Yes, but that is only a subset of the possible (and reasonable)
syllables that can be written in Hangul. It only covers (a superset
of) what occurs in "modern Hangul" (modulo the multiletter issue),
but has really nothing to do with how the Hangul script is constructed.

As an example, the sequence

U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),
summarily written GGGGGA, would be a "Standard Korean syllable
block", too, the same way we would probably expect GGGGGA not to
be broken up by a hyphenation algorithm, whether it looks totally
silly (and in the Korean case, there's no way to display it as
a reasonably-looking syllable block) or not.


KIM, Kyongsok wrote:
... each of the following three can represent Hangul syllable GGA:
1) UAC01 (GGA)
2) U1101 (GG), U1161 (A)
3) U1100 (G), U1100 (G), U1161 (A)
  - By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);
  - However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will
be changed to
U1100 (G), UAC00 (GA), which is "different" from 1) UAC01.

This is indeed the correct analysis. I find it very unfortunate
that U1101 (GG) does not have a *canonical* decomposition mapping
to <U1100 (G), U1100 (G)> (etc. for all the other multi-letter
Hangul Jamos). The Hangul script does NOT have a primitive Jamo
GG. The Hangul GG is, by design, composed of two G Jamos, just
like Latin GG is composed of two G letters.

Well, it's very easy to take this position indeed.

And is also how the Hangul script was actually designed.

Also, it's
also possible to take the position that U+110F is the result of
adding a stroke to U+1100

That is not how the Hangul script was designed, and is thus a
misinterpretation.

(the equivalent, although in this day
and age much less clear, example would be that G is just a
Latin (in the true sense of the old Romans) C with a stroke or
hook added). The Korean script is so well designed that it's
difficult to know where to stop these decompositions.

While that parallel holds for the strokes (dots originally, which
was a bit unfortunate graphically) for the vowels (for instance
Hangul O is NOT a composition of EU and ARAEA), and certain of the
consonants (e.g. THIEUTH is a primitive letter that happens to have
one more stroke than TIKEUT) it does not hold for the doubled consonants
(like SSANGKIYEOK) nor for any of the other multiletter Jamos (like
Hangul E *is* a composition of Hangul EO and Hangul I).

The design documents, both editions, are quite clear on these matters.
So there is no reason to guess how the Hangul letters, and letter
combinations, are constructed. While one may find the philosophy
for the graphical design of the individual letters to sometimes be
a bit doubtful, esp. for the vowels, it is clear what are individual
letters, and what are compositions of letters.

See the original design document, translated to English in

	The Korean Language, Ho-Min Sohn, Cambridge University Press, 1999,
	ISBN 0-521-36123-0 or 0-521-36943-6. (Section 6.3 gives a translation
	to English of the 1444 design document for the Hangul alphabet.)

Also (facsimile only, no translation), in

	A history of Korean Alphabet and Movable Types, Ministry of Culture
	and Information, Republic of Korea, 1970. (Part 1 reproduces the 1444
	official design document for the Hangul alphabet.)

The revised and extended Hangul design document, reproduced,
translated to English (and analysed) in:

	The Korean alphabet of 1446 – Expositions, OPA, The visible speech
	sounds, Annotated translation, Future applicability; Hwun Min Ceng
	Um, Sek Yen Kim-Cho, Humanity Books and AC Press, New York, 2002,
	ISBN 89-428-1587-1. (Reproduces, translates and analyses (in English)
	the 1446 official design document for the Hangul alphabet.)

The extended document from 1446 introduces the kapyeoun- combinations as
compositions with IEUNG at the end. It is clear that the little circle
below is really a IEUNG, not something else.

This book from 2002 also introduces an interesting possible extension
to Hangul, putting "annotations" on the (primitive) Hangul letters
within syllable blocks, for use as a phonetic notation.

Also of relevance:

	The Korean Alphabet, its history and structure, ed. Young-Key
	Kim-Renaud, University of Hawai'i Press, 1997, ISBN 0-824-81989-6.


=====


NOTE NEW BUSINESS ADDRESS AND PHONE
Vint Cerf
Google
1818 Library Street, Suite 400
Reston, VA 20190
202-370-5637
vint at google.com




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081020/3da604d2/attachment-0001.htm 


More information about the Idna-update mailing list