KATS (Korean Agency for Technology and Standards)'s Comments on theUnicode Codepoints and IDNA Internet-Draft

John C Klensin klensin at jck.com
Fri Oct 31 15:06:02 CET 2008


Martin,

It seems to me that your analogy would be useful if there were a
second language that used Hangul syllables but that used Jamo
differently or used a different subset of Jamo.  But that is not
the case; the situation is one of reading the Korean language in
Korean characters. 

If Unicode had combining Han radicals, we would have a much more
relevant analogy of the type you are trying to make.  Instead,
we have an ideographic description language.  And we disallow
both the ideographic description characters and the stand-alone
radicals.

Certainly we know that, across scripts or among different
languages that use the same script in different ways,
familiarity with the script in context is important to what is
confusable in practice.  As I have mentioned to some people in
this group before, I've managed to confuse a short string in
Thai (in a stylized font) with a Latin-character string.
Granted I was sleepy, but I also didn't realize what I was
looking at and made the wrong association.  I came away from
that incident, and others, with strong impressions about the
subjectivity of confusability in practice, impressions that have
led me to oppose excluding characters 

	* on the basis of subjective confusability alone,
	inter-script or between languages sharing the same script
	
	* when the decision would harm one language at the
	expense of another.  I note that no one (I hope) would
	propose to disallow combining double accent because a
	possibly-sleepy German reader might be confused when
	looking at a Hungarian string. 

I am also not sure whether analogies between Jamo and variations
on diacritical marks are really useful.  Picking up from
Michael's observation, in retrospect we would have been better
off (at least for this type of application) had Unicode been
built with a firm rule against either precomposed or combining
characters -- one or the other, but not a mix.  One could
debate, as a theoretical issue, whether either choice would have
been practical, but the mixture obviously creates many issues
that would not exist with either "pure" system.

>From my naive point of view after having followed these
discussions and read several iterations of the Korean notes, the
request to disallow the Jamo seems to me to be very similar to
the strong language used in UTS to discourage any but
descriptive use of ideographic description characters and with
the principle (independent of edge-case details in NFC)
underlying canonical combining normalization.   There is no
cross-language intra-script issue, there is no separate
historical script, there is just an effort to clearly and
cleanly ban the use of the syllable composition characters and
allow the precomposed syllables only.

The fact that many, but not all, of the possible combinations
are effectively disallowed by our NFC rule strengthens my
position: one of our goals is to make permitted character
relationships more clear and expecting people (or even software)
to fully understand the Hangul composition and decomposition
rules in order to understand what is permitted seems much less
desirable to me than a simple "Jamo are not permitted" rule.

Finally, while we associated terms like "phishing" with our
"leave it to the registries" principle, I believe that the
Korean case for exclusion of Jamo has been clearly made and that
we should not attempt to fault (or punish) the rather
considerable effort that has been made here because they didn't
understand the idiosyncrasies of the vocabulary used in the WG.

regards,
   john


--On Friday, 31 October, 2008 19:18 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

> Dear Mr. Kim,
> 
> Many thanks for this document. It is very helpful in that it
> contains some new arguments re. Hangul Jamos. What it
> essentially says is that some historically used Hangul letters
> look too similar to different modern letters to be
> distinguished by the modern user.
> 
> To give one equivalent for Latin, this is as if there were,
> historically, two versions of E, one with a shorter middle
> bar, and another with a middle bar of the same length as the
> top or bottom bar. A modern reader wouldn't distingush between
> the two because s/he wouldn't (at least not actively) remember
> the existence of the historic difference.
>...



More information about the Idna-update mailing list