KATS (Korean Agency for Technology and Standards)'s Comments on theUnicode Codepoints and IDNA Internet-Draft

John C Klensin klensin at jck.com
Fri Oct 31 18:29:06 CET 2008



--On Friday, 31 October, 2008 12:01 -0400 Andrew Sullivan
<ajs at shinkuro.com> wrote:

> Dear colleagues,
> 
> On 31-Oct-08, at 8:51 AM, Michael Everson wrote:
>> 
>> that train has long since left the station. Since the 11K
>> syllables are sufficient (burden of proof on those who
>> believe otherwise) they and they alone should be permitted in
>> Korean IDN.
> 
> I believe I understand the example in this case, and I believe
> I   understand the KATS statement as well.  But none of these
> appear to me   to have answered the fundamental question I had
> before, which is why   these exceptions should be in the
> _protocol_.  They sound to me like   policy.
> 
> Even if only the 11k syllables "alone should be permitted in
> Korean   IDN", it does not follow that any of the codepoints
> that are the   subject of this discssion should be excluded at
> the protocol level.
> 
> I am particularly uneasy with arguments that depend either on  
> confusability or on the way that one could have encoded these  
> characters in Unicode, if one ran the circus.  This working
> group   explicitly ruled the first of those premises out in
> its charter.  The   working group's dependence on properties
> explicitly requires that we   accept the Unicode definitions.
> Even if we think things should be   another way, we're not
> here to specify The Right Way to encode the   writing system
> of a language.  We're here to "internationalize LDH".
>...

Andrew,

I think I'm finally beginning to understand where we have a
disconnect.

We have never agreed that we need to slavishly follow the
Unicode properties and code point assignments.  Some of us would
argue that doing so was part of what got IDNA2003 in trouble
but, even ignoring that, those properties were designed to meet
a large range of needs.  While they are generally useful, there
are a number of characteristics of IDNs that make them unusual
as compared to most applications of Unicode (e.g., neither very
short strings without language identification nor short strings
that mix languages are common in ordinary blocks of text).  

We have been selecting the particular properties we use and how
we combine them.  There could have been other choices, some of
which we considered and rejected.  Given the differences in
requirements for IDNs versus running text, that is reasonable
and natural.  While the result of our rules are different from
Unicode's recommendations for identifiers for use in contexts
like programming languages, they aren't very different.  Neither
the similarities nor the fact that there are differences should
be a surprise given the differences between the applications.

We are trying to "internationalize LDH".  We are trying to do so
in a way that makes DNS labels unambiguous, an issue to which
you are presumably particularly sensitive.  (Here, and below, I
am using "ambiguity" to describe the very specific case in which
two Unicode labels that would be considered equal by experts on
the script do not compare equal because they map to different
A-labels.  Note that is "experts on the script", not "experts on
the language" (e.g., I'm not talking about spelling differences)
nor is it a matter of subjective confusability to either those
who are familiar with the script or those who are not.)   

Part of what is at issue when we talk about "internationalizing
LDH" is the question of what, actually, constitutes a "letter"
for our purposes -- a conclusion that might be different from
the Unicode one because our needs are slightly different.  From
the perspective of the Korean experts, at least as I understand
it, the Jamo should not be treated as "letters", but as "parts
of letters".  The writing system is primarily syllabic, not
alphabetic in the traditional western sense.  Unicode has
recognized that by encoding the syllables, or at least 11
thousand or so of them (by contrast with, e.g., Latin or
Greek-derived scripts, where syllables are definitely not
encoded, although they clearly exist).  Given a basically
syllabic writing system,  Jamo aren't "letters" as we understand
the term in "internationalize LDH" but combining components of
letters, no more suitable for standing alone in a label than a
stray combining diaeresis would be (and I note that we have
special rules to prevent, e.g., leading combining characters in
labels).

That is not about policy, in the sense in which you are using
that term, but about what is, and is not, properly a "letter"
from an IDN perspective.   If I were a registry handling Korean,
I can imagine not permitting all eleven thousand syllables but,
instead, excluding some of them.  That would be a policy
question as you are using the term and clearly appropriate as a
registry decision.

There is another thing that makes this confusing (and which got
me very confused early in this thread).  In general, we depend
on the requirement for NFC normalization to ensure that
different representations of a character (in particular, a
precomposed form and one or more forms built up from components)
do not cause different A-label encodings and hence ambiguous
comparisons.  That is an appropriate thing to do and, in
general, it works.  My initial assumption, many weeks ago and
based on my still-imperfect understanding of Hangul, was that it
did not always work.   Mark and Ken corrected me and asserted
that would work, and work always.  We have seen what appear to
me to be convincing counterexamples to that assertion.   If we
cannot depend on NFC to keep _all_ syllable forms unambiguous,
then it seems to me that we need to supplement NFC with rules
that prevent ambiguity.   As I understand it, those rules could
take two forms.  One is the current KATS proposal, which
excludes all of the Jamo, essentially because they are parts of
letters and not letters.  The other would be to go through the
Jamo, either character by character or in terms of the
HangulSyllableType table and exclude the specific ones that are
problematic with regard to normalization.  

I prefer the first because it is much simpler and easier to
explain to those who are not expert in Hangul and because it
keeps the presentation of the exception lists shorter.  However,
if one wanted to adhere narrowly to the principle for which I
think you are arguing, one would pick the second course of
action and then leave excluding the _rest_ of the Jamo to
registry action (if desired).   In either case, we end up taking
some group of characters (which can be identified by properties
rather than block, but the result is the same) and excluding
them as non-letters, despite the fact that Unicode gives them a
first-level category of "Lo".

Disclaimer and note:  Nothing above claims that Unicode is
"wrong", nor does any of it depend on visual confusability by
people who are more or less familiar with the script.  We just
don't need to get anywhere either.   And, as I said in my
earlier note, I believe that KATS and NIDA have rather
thoroughly explained the issues by now. Even if the vocabulary
they have used in some of their notes can be construed in a way
that would making this into a policy issue, we are not running
an examination on whether they understand our rather specific
terminology and its implications.  We are trying to determine
what is appropriately a "letter" for "internationalize IDN"
purposes.  I believe that the case for Jamo are not
IDN-appropriate letters has been made and made persuasively.

    john



More information about the Idna-update mailing list