Hangul jamo issues

Soobok Lee lsb at lsb.org
Tue Jan 2 15:30:38 CET 2007


On Tue, Jan 02, 2007 at 07:28:40AM -0500, John C Klensin wrote:
> 
> > This is the issue list for Hangul Jamos:
> > 
> > Hangul Jamo ( in Range: 1100-11FF)
> >   These  should be available as input and allowed in labels to
> > make   jamo-only sequences of labels and archaic hangul
> > syllables of labels.   We already have registrations using
> > this characters in IDN.com.  
> 
> For a number of reasons, and with no disrespect to registrations
> in COM, what is permitted under the JET tables and associated
> rules (assuming they exist) for .KR?   Looking at your slightly
> later note, I find it interesting that neither we nor, as far as
> I know, ICANN have heard any requests for changes in this area
> from NIDA, despite a great many comments from NIDA or the
> government on other IDN-related areas.

NIDA(old KRNIC) had suggested some registration guideline/policy
for IDN, but it was the first version and it had plan to include
more code points for korean language next time. 

For example, NIDA did not include any CJK code points in that korean 
table.  No need for CJK characters in Korean language at all? 
No way! its inclusion was just postponed to avoid TC/SC unification 
issue. I was present at the NIDA WG for IDN.

Every korean new-born baby is given both hangul name and CJK name
by parents.  Korean domestic law enforces that CJK name should be 
registered for every new-born baby. Every adult Korean has his/her 
CJK name printed on his/her Residence ID Card.

Moreover, future Stringprep200x is not only for IDNAbis, but also
for other applications like SASL. We need more inclusive
Stringprep200x. 

> 
> To repeat what has been said in other areas, the fact that a
> sequence is legitimate in some present or past use of the
> language, or that it would be comprehensible if used in a name,
> does not imply a "right" to have it included in the DNS.  We
> should be careful about excluding it.   But we should also not
> assume that, because it is possible and sources of conflicts
> cannot easily be identified, permitting it is a good idea.

Hangul jamo sequences has been legitimate _by definition_ 
and by tradition _.  No room for debate!

Some confusible combinations of jamo sequences - as described
below - should be managed by registration policies.

> 
> > Hangul Compatibility Jamo ( in Range: 3130-318F)
> >   These  should be available as input and be mapped
> >   into Hangul Jamo Range 1100-11FF by IDNAbis preprocessing
> > stage in applications.
> 
> I believe we need to assume that every instance in which 
>    ToUnicode(ToASCII(label)) != label
> is trouble waiting to happen.  Requiring the character mapping
> that causes it to occur as part of the standard should be
> avoided unless there is a compelling reason (case-mapping for
> consistency with ASCII label behavior is, for me, one such
> compelling reason).  If the relationship is an artifact of
> Unicode (or other CCS) decisions about whether or not
> conventional characters should be assigned separate code points,
> then I think that any mappings should lie outside the standard
> and in UIs, partially to help make it clear that IDNA-canonical
> forms, and only those forms, should be used in interchange and
> on the wire.  
> 
> For a user typing an IDN or IRI into an application, there is no
> difference between something that is done in a UI and something
> required as part of the standard.  However, the former would
> become invalid as part of IRIs to be transmitted across the wire
> or embedded in a message to others.

I see. Then why not include _selectively_ the compat mappings of NFKC
into future Stringprep200x ?

The compatibility mappings of (NFKC - NFC) should be selectivly
included into Stringprep200x of IDNAbis under the criteria of
"the same glyph" rule. that is, if compatibility mappings 
produces the same glyph and same number of character for the input 
character, those mappings should be included into Stringprep200x. 
"Circled a" and ligatures  can be excluded.

Some compatibility mappings that don't cause glyph changes like above 
has the same importance as casefolding which causes glyph changes.

> 
> >   Ordinary Korean users can type in only these Compatibility
> >   Jamos and cannot type directly those in 1100-11FF (in
> > Windows).   NFKC does this mapping( and composing), but NFC
> > does not.   3164 === U+1160 : compatibility equivalence for
> > hangul filler   3131 === U+1100 : compatibility equivalence
> > for initial KI-EOK   and so on.
> 
> In general, any time one starts talking about what users can
> type, one is out at the UI level of abstraction.   In other
> words, what people can type and, more important, what shows up
> at the interface to an application after they type it, is the
> consequence of user interface and operating system design
> issues, not something that should be compelling as part of
> application protocol design. 

No, that is not from  UI/OS design decision, but from
the code conversion table  between KSC5601<->Unicode itself.
The tables are on the Unicode Consortium homepage.

http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/OLD5601.TXT
Copyright (c) 1991-1994 Unicode, Inc.  All Rights reserved.

KSC5601 UNICODE  DESCRIPTION
0x2421	0x3131	# HANGUL LETTER KIYEOK
0x2422	0x3132	# HANGUL LETTER SSANGKIYEOK
0x2423	0x3133	# HANGUL LETTER KIYEOK-SIOS
0x2424	0x3134	# HANGUL LETTER NIEUN
0x2425	0x3135	# HANGUL LETTER NIEUN-CIEUC
0x2426	0x3136	# HANGUL LETTER NIEUN-HIEUH

You see 0x31xx compat jamo  points in the second column (not 0x11xx).
 
So, all Linux/Solaris/Windows OS/UIs share the same input problem if
they refer to this KSC5601->UNICODE conversion table.

With NFKC, we had not such problem dealing with this.



> 
> >   Need of jamo sequences in inputs:
> >    KSC5601 has only standard 2350 hangul syllables, while its
> > Window-specific     extension (CP949) has full set of 11172
> > hangul syllables.     Microsoft added those thousands of
> > characters to serve korean users' needs,     especially from
> > teenagers and scholars.
> >    So, in linux x-terminal, for example, we cannot type
> > directly these     extended syllables, but can type in only
> > compat. jamo sequences.    And, some code conversion
> > tools(cp949 -> ksc5601) may transform       extended hangul
> > syllables into compat. jamo sequences.    If these compat.
> > jamo sequences are mapped into jamo sequeces(u+11xx)     by
> > preprocessing stage in IDNAbis,
> >     NFC in IDNAbis would further combine these sequences into 
> >     composed hangul syllables.
> 
> Hmm.  I look at that explanation and it seems to me to be a
> strong reason to ban these _in the protocol_: mapping them in
> and out of IDNA/punycode form is going to yield characters
> different from what the user typed in and, 

Exactly. It may produce different *combined* looks if mapped
into u+11xx range due to NFC.

If you strongly maintain such "same look" principle, the other choice
may be simply allowing compat hangul jamos without mapping to u+11xx.
Registration policy shall deal with the remaining problom.

In ksc5601-faithful x-terminal, such compat jamo sequences are 
the only input method to type hangul jamos due to 
above mentioned ksc5601->unicode mapping table.
Again: Jamo sequences is legitimate by definition and has
user demands.

>in some cases,
> perhaps characters that can't even be rendered.   Perhaps I
> don't understand.

The extened hangul syllable can be displayed in CP949 (Windows)
or UTF8, but cannot be displayed in KSC5601 (linux). 
Applications can decide whether their OS supports CP949/UTF8
display. 


> 
> > Hangul Half-Width Jamo ( in Range: FFA0-FFDC)
> >   Ordinary Korean users seldom type in these Jamos in Windows,
> > AFAIK.   So the need of these characters in label inputs is
> > questionable.   NFKC maps these characters into Hangul Jamo
> > Range 1100-11FF.   But NFC does not.
> >   FFA0 === 3164 === U+1160 : compatibility equivalence for
> > hangul filler   FFA1 === 3131 === U+1100 : compatibility
> > equivalence for initial KI-EOK   and so on.
> > 
> > U+3164, U+1160, U+FFA0 Hangul Filler:
> >  U+3164, U+1160 are displayed as blank space 
> >   in Windows.
> >  U+FFA0  Half-width Hangul Filler is displayed
> >   as bold-faced middle dot in Windows.
> >  Need cautions in displaying these characters.
> 
> No, need to prohibit them entirely, under the "no spaces and no
> punctuation" principle, as too risky unless there is compelling
> reason for their inclusion.

They are not spaces, just vowels which have no glyph, by definition. 
So, U+FFA0 is not displayed as white space in Windows.

I think this is the font glyph problem.
Unicode codecharts suggest that the filler should have glyph 
looking like "[HF]", but Windows display them as white spaces.

This can be classified as font implementation issue.
Widespread font implementation problem can affect protocol designs ?

> 
> > Both initial consonant U+1100 and 
> >  its final consonant correspondent U+11A8  
> >  are displayed in the exactly same glyph and margin in Windows.
> >  And so forth for other consonants.
> >   Need cautions in registering and displaying these characters.
> 
> regards,
>    john
> 
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update


More information about the Idna-update mailing list