Consensus Call Tranche 8 (Character Adjustments)

Martin Duerst duerst at it.aoyama.ac.jp
Wed Oct 15 10:01:17 CEST 2008


Mark is of course correct about Korean syllables.
The reason for why John got this wrong is probably the following:

Korean is a very hierarchically designed script,
and it depends which level you look at.

On the level of Jamos (the level involved in the current consensus call),
everything is as described by Mark, and as we are used for other
kinds of characters (except much more regular, and so done by
formulae rather than tables).

On the lower level, usually called "featural", John would be
correct that there is no defined normalization mechanism.
As an example, take the Jamo
    U+1101 HANGUL CHOSEONG SSANGKIYEOK
If you look at it at http://www.unicode.org/charts/PDF/U1100.pdf,
you might agree that it looks like a sequence of
    U+1100 HANGUL CHOSEONG KIYEOK
    U+1100 HANGUL CHOSEONG KIYEOK
However, there is no decomposition along these lines in Unicode.

Regards,    Martin.

At 05:56 08/10/15, Mark Davis wrote:
>> For Korean, there is no
>equivalent because NFC doesn't produce the relevant precomposed
>forms.
>> And, because it doesn't, our problem is not one of
>confusing similarity (a registry problem) but one of having
>comparisons work correctly (a much deeper issue which we have
>generally dealt with in the protocol, in the analogous case by
>the requirement for NFC.
>
>John, your first premise, and thus your whole argument is incorrect. The combining Jamo *do* form composed characters under NFC. Here is an example:
>
><http://unicode.org/cldr/utility/character.jsp?a=1100>U+1100 HANGUL CHOSEONG KIYEOK
><http://unicode.org/cldr/utility/character.jsp?a=1161>U+1161 HANGUL JUNGSEONG A
><http://unicode.org/cldr/utility/character.jsp?a=11A8>U+11A8 HANGUL JONGSEONG KIYEOK
>=>
><http://unicode.org/cldr/utility/character.jsp?a=AC01>U+AC01 HANGUL SYLLABLE GAG
>
>That is, each of the Hangul precomposed syllables decomposes into one or two combining jamo under NFD, and under NFC that sequence of combining jamo composes back into that syllable. The comparisons *do* work correctly, since IDNA labels have to be in NFC.
>
>For non-modern use characters, the NFC form may not combine all of the characters, simply because there may not be a corresponding precomposed form to combine them into. That is not a problem. It is similar to cases with accents; the NFC form composes as much as it can, but where it can't compose it leaves the code points separate.
>
>The key point is that the result is still unique and does not cause a problem for comparison.
>
>Mark
>
>
>On Tue, Oct 14, 2008 at 10:17 PM, John C Klensin <<mailto:klensin at jck.com>klensin at jck.com> wrote:
>
>--On Tuesday, 14 October, 2008 13:22 -0400 Andrew Sullivan 
><<mailto:ajs at commandprompt.com>ajs at commandprompt.com> wrote:
>>> (8.c) Disallow conjoining Hangul jamo per recommendation from 
>>> KRNIC and others, permitting only precomposed syllables. 
>> 
>> This appears to open the character-by-character decision 
>> making that we already ruled out.  As Mark Davis argues, if we 
>> accept this restriction then we probably need to re-open the 
>> discussions about obsolete scripts, &c.  It sounds to me very 
>> like a registry policy.
>For Hangul, the individual Jamo (again, a clearly-identified 
>group of characters, not a character-by-character decision) are 
>used to construct conventional (and precomposed) characters 
>("Hangul syllables").  To the extent to which there is an 
>analogy in Latin-based script, they would be combining 
>characters that combine without a base character.  For 
>Latin-based scripts, we don't need to worry about conflicts 
>between precomposed characters and composing (base+combining 
>character) forms of the same characters because the NFC 
>requirement deals with the problem.   For Korean, there is no 
>equivalent because NFC doesn't produce the relevant precomposed 
>forms.   And, because it doesn't, our problem is not one of 
>confusing similarity (a registry problem) but one of having 
>comparisons work correctly (a much deeper issue which we have 
>generally dealt with in the protocol, in the analogous case by 
>the requirement for NFC.  If Unicode had assigned properties 
>that treated the Syllables differently from the Jamo, we would 
>simply build a rule using those categories and we would not be 
>having a discussion about, e.g., "character by character 
>decisions".  But there is apparently no such property --both the 
>Jamo and the Syllables are in General Category "Lo" and the rest 
>of the properties appear to match as well.
>I think the situation --and the comparison failures that would 
>result if we don't deal with it-- makes a strong case for our 
>disallowing either the Jamo or the Syllables.  The ccTLD 
>registry and local experts strongly prefer that we disallow the 
>Jamo, even though it means that some archaic Syllables and 
>fanciful forms are disallowed as a consequence.   I think we 
>just defer to them.
>Just my opinion, of course.
>> The argument that some people will get 
>> that registry policy wrong has already been floated, and we 
>> rejected it.  Indeed, if we don't reject that premise, then 
>> all of the local mapping approach that we've taken should be 
>> tossed out, and we should go back to strict mapping in the 
>> protocol.
>Again, the issue here is one of comparison failures, not of 
>confusability or other registry policy questions. 
>   john
>_______________________________________________ 
>Idna-update mailing list 
><mailto:Idna-update at alvestrand.no>Idna-update at alvestrand.no 
>http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
>_______________________________________________
>Idna-update mailing list
>Idna-update at alvestrand.no
>http://www.alvestrand.no/mailman/listinfo/idna-update


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst at it.aoyama.ac.jp    



More information about the Idna-update mailing list