Hangul jamo issues

Tue Jan 2 20:07:22 CET 2007

--On Tuesday, 02 January, 2007 23:30 +0900 Soobok Lee
<lsb at lsb.org> wrote:

> On Tue, Jan 02, 2007 at 07:28:40AM -0500, John C Klensin wrote:
>> 
>> > This is the issue list for Hangul Jamos:
>> > 
>> > Hangul Jamo ( in Range: 1100-11FF)
>> >   These  should be available as input and allowed in labels
>> >   to make   jamo-only sequences of labels and archaic hangul
>> > syllables of labels.   We already have registrations using
>> > this characters in IDN.com.  
>> 
>> For a number of reasons, and with no disrespect to
>> registrations in COM, what is permitted under the JET tables
>> and associated rules (assuming they exist) for .KR?   Looking
>> at your slightly later note, I find it interesting that
>> neither we nor, as far as I know, ICANN have heard any
>> requests for changes in this area from NIDA, despite a great
>> many comments from NIDA or the government on other
>> IDN-related areas.
> 
> NIDA(old KRNIC) had suggested some registration
> guideline/policy for IDN, but it was the first version and it
> had plan to include more code points for korean language next
> time. 
> 
> For example, NIDA did not include any CJK code points in that
> korean  table.  No need for CJK characters in Korean language
> at all?  No way! its inclusion was just postponed to avoid
> TC/SC unification  issue. I was present at the NIDA WG for IDN.

First of all and to emphasize the point I was trying to make,
"no need for CJK in Korean IDNs" would not imply, even if it
were true and permanent, "no need for CJK in writing Korean
language".  The questions are separate and, without predicting
what NIDA will decide to do, if substantially every name that
exists in Hangul also exists in Chinese-derived characters, a
strong argument could be made for confusion-avoidance by
prohibiting CJK registrations.  If one does not prohibit them,
then one might well want a variant model that bound the CJK
string for a given name together with the Hangul one, and I'd
imagine that might be hard to implement.

I have no way to know whether it is true, but it has been widely
reported that CJK characters have been completely eliminated
from the writing system for the Korean language in the North, so
we might assume it is possible to do without them. 

I understand that I'm oversimplifying a complex situation here
and that I definitely don't understand all of the issues.   But
I think these are precisely the types of decisions that we, and
the relevant registries, need to make... and need to make
conservatively and with an understanding of what is reasonably
necessary for DNS naming, rather than assuming that "used in the
usual writing system for the language" inherently equals "needed
in IDNs and the DNS".

> Every korean new-born baby is given both hangul name and CJK
> name by parents.  Korean domestic law enforces that CJK name
> should be  registered for every new-born baby. Every adult
> Korean has his/her  CJK name printed on his/her Residence ID
> Card.

Ok.  From my perspective, and remembering the general philosophy
of the JET work, this is a stronger argument for prohibiting CJK
registrations in Korea, and certainly for prohibiting mixed
Hangual-CJK strings, than it is an argument for requiring
support for many or all combinations.

> Moreover, future Stringprep200x is not only for IDNAbis, but
> also for other applications like SASL. We need more inclusive
> Stringprep200x. 

Some parsimony in naming might benefit SASLprep (and other
Stringprep profiles).  Some of the issues are the same and the
same as (at least) the philosophy of the UTC "secure
identifiers" concept: the ability to write a word or string in
the relevant language does not make it a good identifier and
reducing potential confusion in identifier matching is generally
A Good Thing.  So I don't think that pointing out what is done
in a writing system is, by itself, justification for arguing for
more inclusion in Stringprep.   Second, while we have been
assuming that IDNA200x and SASLprep200x will use essentially the
same profile of Stringprep, that is not a hard requirement: if
the needs are different and the differences are important (and
we can explain why), then we might end up with different
profiles.

>> To repeat what has been said in other areas, the fact that a
>> sequence is legitimate in some present or past use of the
>> language, or that it would be comprehensible if used in a
>> name, does not imply a "right" to have it included in the
>> DNS.  We should be careful about excluding it.   But we
>> should also not assume that, because it is possible and
>> sources of conflicts cannot easily be identified, permitting
>> it is a good idea.
>
> Hangul jamo sequences has been legitimate _by definition_ 
> and by tradition _.  No room for debate!
> 
> Some confusible combinations of jamo sequences - as described
> below - should be managed by registration policies.

This has never been the question.  The questions, at least as I
have understood them, lie in whether, for example, one needs to
accommodate both jamo and Hangul syllables in the DNS.   If the
answer is "no", i.e., that it is possible to pick one or the
other and stick with it, then the potential for confusion is
reduced.  If it is necessary to permit both, and it is possible
to write a given language string as either a sequence of
syllable code points or as a sequence of Jamo code points, then
I believe there is a matching problem that, ideally, should be
solved by normalization (i.e., in the protocol and at all levels
of the DNS) and not by registration restrictions alone.   In my
ignorance, I have understood that is not especially easy, so I
am pressing on the question of whether we can pick one or the
other.

> I see. Then why not include _selectively_ the compat mappings
> of NFKC into future Stringprep200x ?

Simpler rules are better and lead to fewer problems.  So my
answers to a question stated as "why not" is "why at all" and
"is there really a compelling need to do this"?   In the
particular case of IDNAbis, and remembering that any character
that is mapped to another one is not represented in the DNS at
all, I'd like all of us to understand what value accepting these
compatibility characters and them mapping them away in the
protocol adds to the IDN/DNS environment.

> The compatibility mappings of (NFKC - NFC) should be selectivly
> included into Stringprep200x of IDNAbis under the criteria of
> "the same glyph" rule. that is, if compatibility mappings 
> produces the same glyph and same number of character for the
> input  character, those mappings should be included into
> Stringprep200x.  "Circled a" and ligatures  can be excluded.

The counter-argument --and I want to stress that this applies to
many, many, scripts other than Hangul-- is that, if these
characters are mapped away, then they cannot appear in output
from the DNS.  Their assignment to separate code points is not
intrinsic to the characters or glyphs, it is an artifact of how
Unicode (and some other CCSs) are organized.  I'd like to
believe that, had Unicode been organized strictly for DNS
purposes, the additional code points would not be there at all
(of course, it has to serve broader purposes, so their presence
is presumably entirely appropriate).   So I can imagine "if you
actually see these compatibility characters on input, map them"
being very good advice for a UI-writer, or even for an Operating
System input driver, but I remain convinced that we should keep
the permitted inputs to IDNA itself as close to what IDNA can
produce (i.e., to ToUnicode(ToASCII(string)) as possible.

> Some compatibility mappings that don't cause glyph changes
> like above  has the same importance as casefolding which
> causes glyph changes.

I don't know how to evaluate "importance".   Case-folding in
IDNs is a very specific compatibility issue with traditional DNS
mapping rules and has little to do with Unicode or compatibility
characters.

>> >   Ordinary Korean users can type in only these Compatibility
>> >   Jamos and cannot type directly those in 1100-11FF (in
>> > Windows).   NFKC does this mapping( and composing), but NFC
>> > does not.   3164 === U+1160 : compatibility equivalence for
>> > hangul filler   3131 === U+1100 : compatibility equivalence
>> > for initial KI-EOK   and so on.
>> 
>> In general, any time one starts talking about what users can
>> type, one is out at the UI level of abstraction.   In other
>> words, what people can type and, more important, what shows up
>> at the interface to an application after they type it, is the
>> consequence of user interface and operating system design
>> issues, not something that should be compelling as part of
>> application protocol design. 
> 
> No, that is not from  UI/OS design decision, but from
> the code conversion table  between KSC5601<->Unicode itself.
> The tables are on the Unicode Consortium homepage.

Ok.  See above.

>...
> I think this is the font glyph problem.
> Unicode codecharts suggest that the filler should have glyph 
> looking like "[HF]", but Windows display them as white spaces.
> 
> This can be classified as font implementation issue.
> Widespread font implementation problem can affect protocol
> designs ?

The need for interoperability, and unique, unambiguous, global
references in the DNS, strongly suggests that, if several
different localized systems cannot figure out how something
should be represented, we should see if we can do without it and
prohibit it.

regards,
   john