Comments on the Unicode Codepoints and IDNA Internet-Draft

Kenneth Whistler kenw at sybase.com
Thu Jul 31 00:30:10 CEST 2008


> One simply cannot have, e.g., "a with a ring above it" and "a
> with a combining ring above" treated as not equal and still have
> an identifier system, especially since a variety of systems and
> operating environments outside the DNS freely map them back and
> forth without asking.

> Coming back to Hangul,...

One simply cannot have, e.g., U+AC00 HANGUL SYLLABLE GA and 
<U+1100, U+1161> KIYEOK + A treated as not equal and still have
an identifier system, especially since a variety of systems and
operating environments outside the DNS freely map them back and
forth without asking.

> it seems to me that the question is
> exactly the one Edmon suggests: if the Jamo are really combining
> objects, used to build up characters and with the possibility of
> representing the same character in either precomposed form or as
> a sequence of Jamo,

They are.

> then we need to ban (at the protocol level)
> either the Jamo or the precomposed forms.

And that is already accomplished in the protocol by the exact
same means that the protocol bans an intolerable distinction
between "a with ring above it" and "a with a combining ring above" --
by the definition of Unicode Normalization Form C, which treats
both of these sets of equivalences as *canonical* equivalences.

protocol-03.txt, Section 4.2:

"That string MUST be in Unicode Normalization Form C (NFC ...)".

For any of the repertoire of 11172 syllables that the Korean NIC wants
to allow for registration, the problem is already solved.

>    "Archaic
> characters" and comments about Cuneiform have nothing to do with
> this except insofar and some characters weren't coded into
> Unicode in precomposed form because they are not in contemporary
> use so, if one wants to have those characters, one has to have
> the Jamo.

Not "because they are not in contemporary use..." The situation
is more complex. The contemporary use Hangul syllables are
much closer to the 2350 originally encoded in Unicode on the
basis of the mapping to the then Korean standard, KS C 5601.
The 11172 syllables eventually encoded are a completion set based
on all possible combinations of the modern jamo letters, and
contain *many* thousands of syllables that aren't actually
in modern use for Korean at all.

Furthermore, in addition to Old Hangul syllables that cannot
be represented with just the modern jamo letters in sets of
3 in sequence, you have to take into account the occasional
usage in Korean of jamos the way katakana is used in Japanese,
to represent spellings of foreign word borrowings that may contain
syllables not present in the modern Korean language. The Korean
NIC may decide it doesn't want to deal with or register any
such sequences -- as I said, that is fine. But I don't consider
it the business of the protocol definition (or this group)
to declare them invalid in principle for domain names.

> One also doesn't want to have a character built out
> of some combination of Jamo floating around when a precomposed
> form of that character is added with Unicode 20.2.

That is a red herring, because such issues (as for any
proposed addition of a precomposed sequence already
representable by a sequence of characters encoded in
the standard) would be dealt with by the Unicode normalization
requirements, as listed above:

protocol-03.txt, Section 4.2:

"That string MUST be in Unicode Normalization Form C (NFC ...)".

>  The fact
> that most combinations of most of the Jamo map out with NFC or
> NFKC and are therefore prohibited by IDNA2008 (and
> IDNA2003)actually supports for disallowing them to make that
> prohibition complete.

I reach precisely the opposite conclusion.

If you reach the conclusion that Korean jamos should be
disallowed this way, you could draw exactly 
the same inference for combining marks that are used to form new 
Latin letters. Trying to disallow conjoining jamos in the *protocol* has
the same kind of flaws that trying to disallow combining diacritical
marks for letters does.

This is tantamount to starting with a completely generic solution
based on general category values and the notion of character
combination, but then getting stuck in the rathole of trying to
disallow particular combining marks and particular combinations
of them based on some particular group's assertion that
"we don't use those combinations." It simply isn't the appropriate
level for the protocol to be concerned with or for this group
to be attempting to make language-by-language and script-by-script
exceptions for how such combinations should be used by the
protocol and which should be disallowed.

This is particularly inappropriate, given that the whole system 
is devised so that registries can impose their own limits on what
repertoire they want to allow.

This would also be opening the maintenance of the IDNA
table to the potential future instability of having
to add any *more* Old Hangul conjoining jamo characters that
might end up added to Unicode in the future to your exception
list, just to stay consistent.

> IMO, Edmon is also correct, IMO, about Chinese radicals.  There
> are no combining radicals in Unicode.

Correct.

>  If there were, we would
> presumably be having exactly they same discussion about allowing
> some or all of them to permit constructing obscure or obsolete
> characters or DISALLOWing the lot.

No, we would not, because the problem, despite the apparent
similarity, is radically different. ;-)

Hangul, by *design*, consists of exactly three slots, with a
limited set of elements. Those elements are phonetic "letters"
(the jamos), which instead of being written out linearly, are
stacked in groups of three, in syllable blocks matching the
syllabic structure of the Korean language. Hangul is the
effective equivalent of writing "big tin can man" with Latin
letters in blocks:

  b i  t i  c a  m a
   g    n    n    n
   
By *design* in the standard the equivalences between the preformed 
syllables and the conjoining jamos was a matter of canonical 
equivalence.

You would never get anywhere with trying the same analysis
for Han characters. That system grew over millenia
by reanalysis of pictographs and ideographs and extensions
based on radical disambiguations of homophonous words.
The radicals themselves are often ad hoc, and there is no universally
agreed-upon set of them. Many components are not separately
encoded. And *nobody* has ever succeeded in creating a
generative system of radicals and components that would work
as a practical system for text processing of Chinese and
Japanese (and other languages occasionally using Han characters).
Furthermore, there is *zero* chance that even if such a system were
built on top of Unicode that equivalences of this sort would
be incorporated into Unicode normalization as *canonical*
(or even *compatibility*) equivalences.

Hangul Syllables and Han characters are so very different that 
any argument from analogy between them is simply not valid.

>  Conversely, the fact that
> combining radicals didn't get codepoints may raise questions, at
> least for IDNA,

Can we please stop revisiting the architecture of Unicode
and ISO/IEC 10646 in this way? The encoding of Han characters
as units has been a consensus
decision which has been reached for *every* single East Asian
national encoding in use and for every single commercial
character set involving Han characters, and which is the unanimous
opinion among the IRG members who are responsible for Han
character repertoire additions to Unicode and ISO/IEC 10646.

Continuing to raise the "Maybe Han characters should have
been encoded with combining pieces" question is
simply counterproductive to the task of dealing with
the Unicode Standard as it is and coming to closure on
the table and protocol definition for IDNA 2008.

--Ken

> about why the combining Jamo were (I think I
> know the reason, and it makes sense, but the question is still
> interesting).




More information about the Idna-update mailing list