Comments on the Unicode Codepoints and IDNA Internet-Draft

Thu Jul 31 02:12:51 CEST 2008

Looking at the draft and the arguments, it seems like a lot is predicated on
the current definition described in Tables-02:

2.1.2. Unstable: toNFKC(toCaseFolded(toNFKC(cp))) != cp

This does satisfy, to a certain degree, the principle motivation by IDNAbis
as described by John, which is to do away with mapping and maintain the
stability of labels at the protocol level.

However, from this discussion it occurs to me that perhaps we should add one
more requirement, which is to define also:

toNFKC(toCaseFolded(toNFKC(label))) != label

This would effectively make labels for IDN much more stable and better
achieve the principle motivation as John described it.  This should also
allow the specification to be scalable for future versions of Unicode.

In order to achieve the above, perhaps we could have these characters listed
as CONTEXTJ/O and apply the above criterion for "Unstable Labels".

Edmon

> -----Original Message-----
> From: idna-update-bounces at alvestrand.no [mailto:idna-update-
> bounces at alvestrand.no] On Behalf Of Kenneth Whistler
> Sent: Thursday, July 31, 2008 6:30 AM
> To: klensin at jck.com
> Cc: idna-update at alvestrand.no; kenw at sybase.com
> Subject: RE: Comments on the Unicode Codepoints and IDNA Internet-Draft
> 
> 
> > One simply cannot have, e.g., "a with a ring above it" and "a
> > with a combining ring above" treated as not equal and still have
> > an identifier system, especially since a variety of systems and
> > operating environments outside the DNS freely map them back and
> > forth without asking.
> 
> > Coming back to Hangul,...
> 
> One simply cannot have, e.g., U+AC00 HANGUL SYLLABLE GA and
> <U+1100, U+1161> KIYEOK + A treated as not equal and still have
> an identifier system, especially since a variety of systems and
> operating environments outside the DNS freely map them back and
> forth without asking.
> 
> > it seems to me that the question is
> > exactly the one Edmon suggests: if the Jamo are really combining
> > objects, used to build up characters and with the possibility of
> > representing the same character in either precomposed form or as
> > a sequence of Jamo,
> 
> They are.
> 
> > then we need to ban (at the protocol level)
> > either the Jamo or the precomposed forms.
> 
> And that is already accomplished in the protocol by the exact
> same means that the protocol bans an intolerable distinction
> between "a with ring above it" and "a with a combining ring above" --
> by the definition of Unicode Normalization Form C, which treats
> both of these sets of equivalences as *canonical* equivalences.
> 
> protocol-03.txt, Section 4.2:
> 
> "That string MUST be in Unicode Normalization Form C (NFC ...)".
> 
> For any of the repertoire of 11172 syllables that the Korean NIC wants
> to allow for registration, the problem is already solved.
> 
> >    "Archaic
> > characters" and comments about Cuneiform have nothing to do with
> > this except insofar and some characters weren't coded into
> > Unicode in precomposed form because they are not in contemporary
> > use so, if one wants to have those characters, one has to have
> > the Jamo.
> 
> Not "because they are not in contemporary use..." The situation
> is more complex. The contemporary use Hangul syllables are
> much closer to the 2350 originally encoded in Unicode on the
> basis of the mapping to the then Korean standard, KS C 5601.
> The 11172 syllables eventually encoded are a completion set based
> on all possible combinations of the modern jamo letters, and
> contain *many* thousands of syllables that aren't actually
> in modern use for Korean at all.
> 
> Furthermore, in addition to Old Hangul syllables that cannot
> be represented with just the modern jamo letters in sets of
> 3 in sequence, you have to take into account the occasional
> usage in Korean of jamos the way katakana is used in Japanese,
> to represent spellings of foreign word borrowings that may contain
> syllables not present in the modern Korean language. The Korean
> NIC may decide it doesn't want to deal with or register any
> such sequences -- as I said, that is fine. But I don't consider
> it the business of the protocol definition (or this group)
> to declare them invalid in principle for domain names.
> 
> > One also doesn't want to have a character built out
> > of some combination of Jamo floating around when a precomposed
> > form of that character is added with Unicode 20.2.
> 
> That is a red herring, because such issues (as for any
> proposed addition of a precomposed sequence already
> representable by a sequence of characters encoded in
> the standard) would be dealt with by the Unicode normalization
> requirements, as listed above:
> 
> protocol-03.txt, Section 4.2:
> 
> "That string MUST be in Unicode Normalization Form C (NFC ...)".
> 
> >  The fact
> > that most combinations of most of the Jamo map out with NFC or
> > NFKC and are therefore prohibited by IDNA2008 (and
> > IDNA2003)actually supports for disallowing them to make that
> > prohibition complete.
> 
> I reach precisely the opposite conclusion.
> 
> If you reach the conclusion that Korean jamos should be
> disallowed this way, you could draw exactly
> the same inference for combining marks that are used to form new
> Latin letters. Trying to disallow conjoining jamos in the *protocol* has
> the same kind of flaws that trying to disallow combining diacritical
> marks for letters does.
> 
> This is tantamount to starting with a completely generic solution
> based on general category values and the notion of character
> combination, but then getting stuck in the rathole of trying to
> disallow particular combining marks and particular combinations
> of them based on some particular group's assertion that
> "we don't use those combinations." It simply isn't the appropriate
> level for the protocol to be concerned with or for this group
> to be attempting to make language-by-language and script-by-script
> exceptions for how such combinations should be used by the
> protocol and which should be disallowed.
> 
> This is particularly inappropriate, given that the whole system
> is devised so that registries can impose their own limits on what
> repertoire they want to allow.
> 
> This would also be opening the maintenance of the IDNA
> table to the potential future instability of having
> to add any *more* Old Hangul conjoining jamo characters that
> might end up added to Unicode in the future to your exception
> list, just to stay consistent.
> 
> > IMO, Edmon is also correct, IMO, about Chinese radicals.  There
> > are no combining radicals in Unicode.
> 
> Correct.
> 
> >  If there were, we would
> > presumably be having exactly they same discussion about allowing
> > some or all of them to permit constructing obscure or obsolete
> > characters or DISALLOWing the lot.
> 
> No, we would not, because the problem, despite the apparent
> similarity, is radically different. ;-)
> 
> Hangul, by *design*, consists of exactly three slots, with a
> limited set of elements. Those elements are phonetic "letters"
> (the jamos), which instead of being written out linearly, are
> stacked in groups of three, in syllable blocks matching the
> syllabic structure of the Korean language. Hangul is the
> effective equivalent of writing "big tin can man" with Latin
> letters in blocks:
> 
>   b i  t i  c a  m a
>    g    n    n    n
> 
> By *design* in the standard the equivalences between the preformed
> syllables and the conjoining jamos was a matter of canonical
> equivalence.
> 
> You would never get anywhere with trying the same analysis
> for Han characters. That system grew over millenia
> by reanalysis of pictographs and ideographs and extensions
> based on radical disambiguations of homophonous words.
> The radicals themselves are often ad hoc, and there is no universally
> agreed-upon set of them. Many components are not separately
> encoded. And *nobody* has ever succeeded in creating a
> generative system of radicals and components that would work
> as a practical system for text processing of Chinese and
> Japanese (and other languages occasionally using Han characters).
> Furthermore, there is *zero* chance that even if such a system were
> built on top of Unicode that equivalences of this sort would
> be incorporated into Unicode normalization as *canonical*
> (or even *compatibility*) equivalences.
> 
> Hangul Syllables and Han characters are so very different that
> any argument from analogy between them is simply not valid.
> 
> >  Conversely, the fact that
> > combining radicals didn't get codepoints may raise questions, at
> > least for IDNA,
> 
> Can we please stop revisiting the architecture of Unicode
> and ISO/IEC 10646 in this way? The encoding of Han characters
> as units has been a consensus
> decision which has been reached for *every* single East Asian
> national encoding in use and for every single commercial
> character set involving Han characters, and which is the unanimous
> opinion among the IRG members who are responsible for Han
> character repertoire additions to Unicode and ISO/IEC 10646.
> 
> Continuing to raise the "Maybe Han characters should have
> been encoded with combining pieces" question is
> simply counterproductive to the task of dealing with
> the Unicode Standard as it is and coming to closure on
> the table and protocol definition for IDNA 2008.
> 
> --Ken
> 
> > about why the combining Jamo were (I think I
> > know the reason, and it makes sense, but the question is still
> > interesting).
> 
> 
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update