<div dir="ltr">We move forward just as we do for the other scripts; we don&#39;t exclude non-modern characters, but registries are free to do that themselves.<div><br clear="all">Mark<br>

<br><br><div class="gmail_quote">On Mon, Oct 20, 2008 at 12:25 PM, Patrik Fältström <span dir="ltr">&lt;<a href="mailto:patrik@frobbit.se">patrik@frobbit.se</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="Ih2E3d">On 20 okt 2008, at 11.57, Mark Davis wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I think this subject has been seriously muddled by misinformation, thus<br>

causing those who are not completely familiar with the way that Hangul is<br>

encoded in Unicode to be mislead.<br>

</blockquote>

<br></div>

What confuses me is that the Koreans say there is a problem, and you say there is not a problem.<br>

<br>

Specifically I am confused of this first sentence of yours. No, I am not completely familiar with Hangul as I am not Korean, so of course I might be mislead.<br>

<br>

But the people from Korea are people I have to trust as they are the ones that use the language. And I take for granted you do not imply the people from Korea are mislead on how Hangul is encoded in Unicode?<br>

<br>

They say there is a problem. You say there is not any problem.<br>

<br>

How do we move forward?<br>

<br>

 &nbsp; Patrik<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="Ih2E3d">

The proposed label step:<br>

<br>

toNFKC(toCaseFolded(toNFKC(label))) != label<br>

<br>

<br>

is pointless, since case folding doesn&#39;t have any effect on Korean<br>

characters (jamo or syllables) and any label is guaranteed to be in NFKC<br>

format anyway, due to other provisions in Tables and Protocol. The<br>

compatibility jamos are also not in question, since they also not in NFKC.<br>

<br>

The characters in question are all and only the following:<br>

<br>

The conjoining Jamos.<br>

<br></div>

U+1100 &lt;<a href="http://unicode.org/cldr/utility/character.jsp?a=1100" target="_blank">http://unicode.org/cldr/utility/character.jsp?a=1100</a>&gt; HANGUL<br>

CHOSEONG KIYEOK<br>

…{88}…U+1159 &lt;<a href="http://unicode.org/cldr/utility/character.jsp" target="_blank">http://unicode.org/cldr/utility/character.jsp</a>?a=1159&gt; HANGUL<br>

CHOSEONG YEORINHIEUH<br>

U+1161 &lt;<a href="http://unicode.org/cldr/utility/character.jsp?a=1161" target="_blank">http://unicode.org/cldr/utility/character.jsp?a=1161</a>&gt; HANGUL<br>

JUNGSEONG A<br>

…{64}…U+11A2 &lt;<a href="http://unicode.org/cldr/utility/character.jsp" target="_blank">http://unicode.org/cldr/utility/character.jsp</a>?a=11A2&gt; HANGUL<br>

JUNGSEONG SSANGARAEA<br>

U+11A8 &lt;<a href="http://unicode.org/cldr/utility/character.jsp?a=11A8" target="_blank">http://unicode.org/cldr/utility/character.jsp?a=11A8</a>&gt; HANGUL<br>

JONGSEONG KIYEOK<br>

…{80}…U+11F9 &lt;<a href="http://unicode.org/cldr/utility/character.jsp" target="_blank">http://unicode.org/cldr/utility/character.jsp</a>?a=11F9&gt; HANGUL<div class="Ih2E3d"><br>

JONGSEONG YEORINHIEUH<br>

<br>

<br>

The Hangul Syllables.<br>

<br></div>

U+AC00 &lt;<a href="http://unicode.org/cldr/utility/character.jsp?a=AC00" target="_blank">http://unicode.org/cldr/utility/character.jsp?a=AC00</a>&gt; ( 가 ) HANGUL<br>

SYLLABLE GA<br>

…{11170}…U+D7A3 &lt;<a href="http://unicode.org/cldr/utility/character.jsp" target="_blank">http://unicode.org/cldr/utility/character.jsp</a>?a=D7A3&gt; ( 힣 )<div><div></div><div class="Wj3C7c"><br>

HANGUL SYLLABLE HIH<br>

<br>

<br>

Any sequence of Jamo syllables that could correspond to a Hangul Syllable<br>

according to the Unicode Standard canonical equivalence is transformed into<br>

it by the toNFKC function. Thus a sequence of Jamo syllables that could<br>

correspond to a HS according to the Unicode Standard *cannot* be in an<br>

IDNA2008 label according to Tables and Protocol.<br>

<br>

That&#39;s is meant by saying that there is no comparison problem. Anything that<br>

is equivalent to a HS according to TUS will already be a HS in an IDNA2008<br>

label according to Tables and Protocol already.<br>

<br>

Now, one could have a contextual rule that forbade Jamo in situations where<br>

they could not be part of a valid syllable, and if people really wanted that<br>

we could do it.<br>

<br>

But frankly, it is not worth the effort. Unlike the case of the ZW joiners,<br>

these are not invisible characters; the worst that would happen is someone<br>

would see nonsense on the screen -- but it is not our place to try to forbid<br>

nonsensical labels.<br>

<br>

This is clearly a case where the Korean NIC is free to narrow the set of<br>

labels they accept to exclude non-modern characters, just as the German NIC<br>

is free to exclude archaic German characters, or the British NIC free to<br>

exclude archaic English characters (like Þ or ð).<br>

<br>

Mark<br>

<br>

<br>

On Mon, Oct 20, 2008 at 11:10 AM, Vint Cerf &lt;<a href="mailto:vint@google.com" target="_blank">vint@google.com</a>&gt; wrote:<br>

<br>

</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div></div><div class="Wj3C7c">

Consensus Call Tranche 8 (Character Adjustments) - Addendum<br>

<br>

I neglected to summarize a number of messages relating to the JAMO<br>

discussion (they had subject fields that were specific to the JAMO<br>

discussion and did not appear when all the email was sorted with the<br>

original subject of the consensus call)<br>

<br>

As a result, the polling actually produced 9 YES and 8 NO - still clearly<br>

no final consensus.<br>

<br>

(8.c) Disallow conjoining Hangul jamo per recommendation from<br>

KRNIC and others, permitting only precomposed syllables.<br>

<br>

COMMENTS:<br>

<br>

I agree with the line of thought that we really should not disregard the<br>

results of the consensus position established by the most relevant language<br>

community after a rather extensive consensus process, so in general, I would<br>

side with the experts in Korea.<br>

<br>

<br>

Nevertheless, having been through this discussion for many times, I<br>

understand that there are opinions otherwise and am hoping to make a<br>

suggestion that could reconcile the lines of thought and be consistent with<br>

our architecture. &nbsp;When we last discussed the issue of conjoining Hangul<br>

Jamo, I had suggested exploring the possibility of addressing them in the<br>

following manner:<br>

<br>

<br>

1. categorize all Hangul Jamo as CONTEXTO<br>

2. add stability contextual rule for these codepoints where the following<br>

must be true:<br>

toNFKC(toCaseFolded(toNFKC(label))) != label<br>

<br>

<br>

<br>

<br>

I am not familiar enough with Korean, but this might strike a graceful<br>

balance between disallowing conjoining jamo that forms a modern hangul and<br>

continue to allow archaic Jamo without creating too much of a confusion?...<br>

<br>

<br>

If I recall correctly, there was response that it seemed interesting, but<br>

was not further discussed. &nbsp;Do people think it might be a viable approach to<br>

resolve the issue?<br>

================<br>

<br>

As I understand it, and I agree, it might not solve all the issues (as it<br>

stands, still thinking), but it does solve 2 types of issues:<br>

<br>

1. combination of modern Jamos that do combine to a Hangul syllable, e.g.:<br>

<br>

U+1109;U+1161;U+11BC &nbsp;=&gt; &nbsp;U+C0C1<br>

<br>

In this case, the use of &lt;U+1109;U+1161;U+11BC&gt; would effectively be<br>

disallowed.<br>

<br>

<br>

2. combination of modern Jamos with old Jamos which combine to 1 Hangul<br>

syllable and 1 old Jamo, e.g.:<br>

<br>

U+1109;U+1161;U+11F0 &nbsp;=&gt; &nbsp;U+C0AC;U+11F0<br>

<br>

In this case, also, the use of &lt;U+1109;U+1161;U+11F0&gt; would be effectively<br>

disallowed.<br>

<br>

It seems to me, if we are going to not disallow jamos, this would at least<br>

be a measure to avoid some of the most obvious problems in the context of<br>

IDN.<br>

<br>

The cases where no combination happens under KC are the cases which would<br>

need further investigation. &nbsp;It may be possible to add additional rules<br>

based on the algorithms for displaying Hangul characters....?...<br>

<br>

=======<br>

<br>

</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Mark Davis &lt;*<a href="mailto:mark@macchiato.com" target="_blank">mark@macchiato.com</a>* &lt;<a href="mailto:mark@macchiato.com" target="_blank">mark@macchiato.com</a>&gt;&gt; Wed, Oct 15, 2008<br>

</blockquote><div><div></div><div class="Wj3C7c">

at 5:56 AM<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

That is, each of the Hangul precomposed syllables decomposes into one or<br>

</blockquote>

two<br>

<br>

<br>

one or two (wrong)---&gt; two or three (correct) (Am I missing something<br>

here?)<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

combining jamo under NFD, and under NFC that sequence of combining jamo<br>

</blockquote>

composes<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

back into that syllable. The comparisons *do* work correctly,<br>

since IDNA labels have to be in NFC.<br>

</blockquote>

<br>

<br>

- Well, I would have to disagree with you.<br>

Let me explain why the above claim is not correct.<br>

<br>

<br>

- According to UCS (ISO/IEC 10646), each of the following three can<br>

represent<br>

Hangul syllable GGA:<br>

<br>

1) UAC01 (GGA)<br>

<br>

2) U1101 (GG), U1161 (A)<br>

<br>

3) U1100 (G), U1100 (G), U1161 (A)<br>

<br>

- By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);<br>

- However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will be changed to<br>

U1100 (G), UAC00 (GA), which is &quot;different&quot; from 1) UAC01.<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

The comparisons *do* work correctly,<br>

</blockquote>

<br>

<br>

- ??? Isn&#39;t it considered comparison failure? (Am I missing something<br>

here?)<br>

- As we saw, NFC/NFD does not work correctly even for modern Hangul,<br>

(not to mention Old Hangul)!<br>

<br>

[comment by another WG member:<br>

<br>

This is indeed the correct analysis. I find it very unfortunate<br>

that U1101 (GG) does not have a *canonical* decomposition mapping<br>

to &lt;U1100 (G), U1100 (G)&gt; (etc. for all the other multi-letter<br>

Hangul Jamos). The Hangul script does NOT have a primitive Jamo<br>

GG. The Hangul GG is, by design, composed of two G Jamos, just<br>

like Latin GG is composed of two G letters.]<br>

<br>

[comment by another WG member:<br>

Reading through section 3.12 of Unicode 5.0 is somewhat confusing,<br>

because it tries to be very, very general for determining sylable<br>

boundaries (virtually everything goes, as long as you can somehow<br>

imagine that you might make a Korean syllable block out of it,<br>

even if no such block ever has been made), whereas the descriptions<br>

for canonical composition and decomposition are quite limited<br>

(one block &lt;=&gt; two or three Jamo, depending on whether there is<br>

a final consonant (group) or not). As an example, the sequence<br>

<br>

U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),<br>

summarily written GGGGGA, would be a &quot;Standard Korean syllable<br>

block&quot;, too, the same way we would probably expect GGGGGA not to<br>

be broken up by a hyphenation algorithm, whether it looks totally<br>

silly (and in the Korean case, there&#39;s no way to display it as<br>

a reasonably-looking syllable block) or not.]<br>

<br>

[comment by another WG member:<br>

Well, it&#39;s very easy to take this position indeed. Also, it&#39;s<br>

also possible to take the position that U+110F is the result of<br>

adding a stroke to U+1100 (the equivalent, although in this day<br>

and age much less clear, example would be that G is just a<br>

Latin (in the true sense of the old Romans) C with a stroke or<br>

hook added). The Korean script is so well designed that it&#39;s<br>

difficult to know where to stop these decompositions.]<br>

<br>

------------<br>

<br>

Note. For example, in XML 1.0 (fourth ed), 2) U1101 (GG), U1161 (A) is NOT<br>

allowed since #x1101 is not included; in contrast, 3) U1100 (G), U1100 (G),<br>

U1161 (A) IS allowed.<br>

<br></div></div>

(source :*<a href="http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar*" target="_blank">http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar*</a>&lt;<a href="http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar" target="_blank">http://www.w3.org/TR/2006/REC-xml-20060816/#NT-BaseChar</a>&gt;<div>

<div></div><div class="Wj3C7c"><br>

)<br>

<br>

.#x1100 | [#x1102-#x1103] | [#x1105-#x1107] | #x1109 | [#x110B-#x110C] |<br>

<br>

[#x110E-#x1112] | #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150 |<br>

<br>

[#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163 | #x1165 | #x1167 |<br>

#x1169 |<br>

<br>

[#x116D-#x116E] | [#x1172-#x1173] | #x1175 | #x119E | #x11A8 | #x11AB |<br>

<br>

[#x11AE-#x11AF] | [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB |<br>

#x11F0 |<br>

<br>

#x11F9 | ..<br>

<br>

KIM, Kyongsok<br>

* I have been a chair of Korea JTC1/SC2 (a committee on Coded Character<br>

Set) since 1993.<br>

This committee represents Korea in ISO/IEC JTC1/SC2 which is in charge of<br>

UCS (ISO/IEC 10646).<br>

<br>

====<br>

I would like to know where in ISO/IEC 10646 the type of sequence described<br>

in 3 is &#39;allowed&#39; to represent such Hangul syllables. Because to the best of<br>

my knowledge it is not.<br>

If it is not, the whole argument falls flat.<br>

IN XML 1.1, the syllable itself GGA is already allowed in the same BaseChar<br>

production list: &quot;[#xAC00-#xD7A3]&quot;, so the Hangul syllable repertoire is<br>

already covered w/o adding that sequence explicitly, and the syllable is the<br>

NFC representation of the GGA syllable.<br>

<br>

Michel Suignard<br>

(project editor for 10646)<br>

<br>

[Note by another WG member:<br>

10646 is rather silent on that mattar. But see the Unicode<br>

standard. In version 5.0 this is discussed in section 3.12,<br>

&quot;Conjoining jamo behaviour&quot;. The key sentence there states:<br>

<br>

Unicode&gt; Standard Korean syllable block: A sequence of one or more L<br>

Unicode&gt; followed by a sequence of one or more V and a sequence of zero<br>

Unicode&gt; or more T, or any other sequence that is canonically equivalent.]<br>

<br>

<br>

=====<br>

<br>

Unicode&gt; Standard Korean syllable block: A sequence of one or more L<br>

Unicode&gt; followed by a sequence of one or more V and a<br>

sequence of zero &gt;Unicode&gt; or more T, or any other sequence<br>

that is canonically equivalent.<br>

<br>

Reading through section 3.12 of Unicode 5.0 is somewhat confusing,<br>

because it tries to be very, very general for determining sylable<br>

boundaries (virtually everything goes, as long as you can somehow<br>

immagine that you might make a Korean syllable block out of it,<br>

even if no such block ever has been made),<br>

<br>

&quot;no such block ever has been made&quot; is not a consideration for an<br>

alphabetic script, like Hangul. But there are practical limitations<br>

of size in this case, since one tries (in display/print) to fit all<br>

letters of a syllable into a graphical block the size of an ideograph.<br>

Some &quot;syllables&quot;, like GGGGGA in Hangul, would simply be too crammed<br>

(unless the block size was gigantic). On the other hand, GGGGGA is<br>

not a very reasonable &quot;syllable&quot; in text representing real (and<br>

reasonably spelled) words.<br>

<br>

whereas the descriptions<br>

for canonical composition and decomposition are quite limited<br>

(one block &lt;=&gt; two or three Jamo, depending on whether there is<br>

a final consonant (group) or not).<br>

<br>

Yes, but that is only a subset of the possible (and reasonable)<br>

syllables that can be written in Hangul. It only covers (a superset<br>

of) what occurs in &quot;modern Hangul&quot; (modulo the multiletter issue),<br>

but has really nothing to do with how the Hangul script is constructed.<br>

<br>

As an example, the sequence<br>

<br>

U+1101 (GG) U+1100 (G) U+1100 (G) U+1100 (G) U+1161 (A),<br>

summarily written GGGGGA, would be a &quot;Standard Korean syllable<br>

block&quot;, too, the same way we would probably expect GGGGGA not to<br>

be broken up by a hyphenation algorithm, whether it looks totally<br>

silly (and in the Korean case, there&#39;s no way to display it as<br>

a reasonably-looking syllable block) or not.<br>

<br>

<br>

KIM, Kyongsok wrote:<br>

... each of the following three can represent Hangul syllable GGA:<br>

1) UAC01 (GGA)<br>

2) U1101 (GG), U1161 (A)<br>

3) U1100 (G), U1100 (G), U1161 (A)<br>

- By NFC, 2) U1101 (GG), U1161 (A) will be changed to 1) UAC01 (GGA);<br>

- However, by NFC, 3) U1100 (G), U1100 (G), U1161 (A) will<br>

be changed to<br>

U1100 (G), UAC00 (GA), which is &quot;different&quot; from 1) UAC01.<br>

<br>

This is indeed the correct analysis. I find it very unfortunate<br>

that U1101 (GG) does not have a *canonical* decomposition mapping<br>

to &lt;U1100 (G), U1100 (G)&gt; (etc. for all the other multi-letter<br>

Hangul Jamos). The Hangul script does NOT have a primitive Jamo<br>

GG. The Hangul GG is, by design, composed of two G Jamos, just<br>

like Latin GG is composed of two G letters.<br>

<br>

Well, it&#39;s very easy to take this position indeed.<br>

<br>

And is also how the Hangul script was actually designed.<br>

<br>

Also, it&#39;s<br>

also possible to take the position that U+110F is the result of<br>

adding a stroke to U+1100<br>

<br>

That is not how the Hangul script was designed, and is thus a<br>

misinterpretation.<br>

<br>

(the equivalent, although in this day<br>

and age much less clear, example would be that G is just a<br>

Latin (in the true sense of the old Romans) C with a stroke or<br>

hook added). The Korean script is so well designed that it&#39;s<br>

difficult to know where to stop these decompositions.<br>

<br>

While that parallel holds for the strokes (dots originally, which<br>

was a bit unfortunate graphically) for the vowels (for instance<br>

Hangul O is NOT a composition of EU and ARAEA), and certain of the<br>

consonants (e.g. THIEUTH is a primitive letter that happens to have<br>

one more stroke than TIKEUT) it does not hold for the doubled consonants<br>

(like SSANGKIYEOK) nor for any of the other multiletter Jamos (like<br>

Hangul E *is* a composition of Hangul EO and Hangul I).<br>

<br>

The design documents, both editions, are quite clear on these matters.<br>

So there is no reason to guess how the Hangul letters, and letter<br>

combinations, are constructed. While one may find the philosophy<br>

for the graphical design of the individual letters to sometimes be<br>

a bit doubtful, esp. for the vowels, it is clear what are individual<br>

letters, and what are compositions of letters.<br>

<br>

See the original design document, translated to English in<br>

<br>

The Korean Language, Ho-Min Sohn, Cambridge University Press, 1999,<br>

ISBN 0-521-36123-0 or 0-521-36943-6. (Section 6.3 gives a translation<br>

to English of the 1444 design document for the Hangul alphabet.)<br>

<br>

Also (facsimile only, no translation), in<br>

<br>

A history of Korean Alphabet and Movable Types, Ministry of Culture<br>

and Information, Republic of Korea, 1970. (Part 1 reproduces the 1444<br>

official design document for the Hangul alphabet.)<br>

<br>

The revised and extended Hangul design document, reproduced,<br>

translated to English (and analysed) in:<br>

<br>

The Korean alphabet of 1446 – Expositions, OPA, The visible speech<br>

sounds, Annotated translation, Future applicability; Hwun Min Ceng<br>

Um, Sek Yen Kim-Cho, Humanity Books and AC Press, New York, 2002,<br>

ISBN 89-428-1587-1. (Reproduces, translates and analyses (in English)<br>

the 1446 official design document for the Hangul alphabet.)<br>

<br>

The extended document from 1446 introduces the kapyeoun- combinations as<br>

compositions with IEUNG at the end. It is clear that the little circle<br>

below is really a IEUNG, not something else.<br>

<br>

This book from 2002 also introduces an interesting possible extension<br>

to Hangul, putting &quot;annotations&quot; on the (primitive) Hangul letters<br>

within syllable blocks, for use as a phonetic notation.<br>

<br>

Also of relevance:<br>

<br></div></div><div class="Ih2E3d">

The Korean Alphabet, its history and structure, ed. Young-Key<br>

Kim-Renaud, University of Hawai&#39;i Press, 1997, ISBN 0-824-81989-6.<br>

<br>

<br>

=====<br>

<br>

<br>

NOTE NEW BUSINESS ADDRESS AND PHONE<br>

Vint Cerf<br>

Google<br>

1818 Library Street, Suite 400<br>

Reston, VA 20190<br>

202-370-5637<br>

<a href="mailto:vint@google.com" target="_blank">vint@google.com</a><br>

<br>

<br>

<br>

<br>

<br></div>

_______________________________________________<br>

Idna-update mailing list<br>

<a href="mailto:Idna-update@alvestrand.no" target="_blank">Idna-update@alvestrand.no</a><br>

<a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>

<br>

<br>

</blockquote>

_______________________________________________<br>

Idna-update mailing list<br>

<a href="mailto:Idna-update@alvestrand.no" target="_blank">Idna-update@alvestrand.no</a><br>

<a href="http://www.alvestrand.no/mailman/listinfo/idna-update" target="_blank">http://www.alvestrand.no/mailman/listinfo/idna-update</a><br>

</blockquote>

<br>

</blockquote></div><br></div></div>