Consensus Call Tranche 8 (Character Adjustments)

Mark Davis mark at macchiato.com
Tue Oct 14 22:56:39 CEST 2008


> For Korean, there is no
equivalent because NFC doesn't produce the relevant precomposed
forms.> And, because it doesn't, our problem is not one of
confusing similarity (a registry problem) but one of having
comparisons work correctly (a much deeper issue which we have
generally dealt with in the protocol, in the analogous case by
the requirement for NFC.
John, your first premise, and thus your whole argument is incorrect. The
combining Jamo *do* form composed characters under NFC. Here is an example:

U+1100 <http://unicode.org/cldr/utility/character.jsp?a=1100> ( ᄀ ) HANGUL
CHOSEONG KIYEOK
U+1161 <http://unicode.org/cldr/utility/character.jsp?a=1161> ( ᅡ ) HANGUL
JUNGSEONG A
U+11A8 <http://unicode.org/cldr/utility/character.jsp?a=11A8> ( ᆨ ) HANGUL
JONGSEONG KIYEOK
=>
U+AC01 <http://unicode.org/cldr/utility/character.jsp?a=AC01> ( 각 ) HANGUL
SYLLABLE GAG

That
is, each of the Hangul precomposed syllables decomposes into one or
two combining jamo under NFD, and under NFC that sequence of combining
jamo composes back into that syllable. The comparisons *do* work
correctly, since
IDNA labels have to be in NFC.

For non-modern use characters, the NFC form may not combine all of the
characters,
simply because there may not be a corresponding precomposed form to combine
them
into. That is not a problem. It is similar to cases with accents; the
NFC form composes as much as it can, but where it can't compose it
leaves the code points separate.

The key point is that
the result is still unique and does not cause a problem for comparison.

Mark


On Tue, Oct 14, 2008 at 10:17 PM, John C Klensin <klensin at jck.com> wrote:

>
>
> --On Tuesday, 14 October, 2008 13:22 -0400 Andrew Sullivan
> <ajs at commandprompt.com> wrote:
>
> > On Sun, Oct 12, 2008 at 05:25:27AM -0400, Vint Cerf wrote:
> >> Consensus Call Tranche 8 (character adjustments)
> >>
> >> Place your reply here: [NO]
> >>
> >> COMMENTS:
> >
> >> (8.a) Make Eszett Protocol-Valid per list discussion.
> >>
> >> (8.b) Make Greek final sigma Protocol-Valid per list
> >> discussion.
> >
> > Since the call is all-or-nothing, I have to respond "no".  On
> > these two, I have no opinion; I don't feel sufficiently
> > qualified to say whether these individual characters should be
> > altered.  My understanding is that, because they are
> > consistent with the tables approach that we are taking, the
> > only reason to exclude them would be historical.
>
> For whatever my opinion is worth, exactly.
>
> >  Since the
> > unhappiness with some of those historical decisions is part of
> > the justification for the current work, it seems to me that
> > these ought to be allowed (although I wonder whether 8.b ought
> > to have a context rule).
>
> Could you explain why you would require a context rule for Final
> Sigma without requiring one for Eszett?  Certainly it would be
> easier to specify a rule for the former ("Script=Greek") while
> the latter would presumably either require either "Script=Latin"
> (which wouldn't do much good) or an enumerated list of
> characters.  One can't require that the character actually
> appear in the last position in a label without preventing people
> from constructing labels by cramming words together... any
> prohibition along _those_ lines should certainly be a registry
> decision, IMO.
>
> For the record (and context when that discussion re-emerges on
> the list), at least some of the Greek IDN community would prefer
> that we preserve the IDNA2003 mapping / case-folding behavior
> for final sigma even if that is the only required mapping in
> IDNA2008.
>
> >> (8.c) Disallow conjoining Hangul jamo per recommendation from
> >> KRNIC and others, permitting only precomposed syllables.
> >
> > This appears to open the character-by-character decision
> > making that we already ruled out.  As Mark Davis argues, if we
> > accept this restriction then we probably need to re-open the
> > discussions about obsolete scripts, &c.  It sounds to me very
> > like a registry policy.
>
> Let me try to explain the other point of view, to the extent to
> which I understand the issues as they have been explained to me
> by the group associated with the Korean registry (if I have it
> wrong, I hope they will step in directly).  I am going to try to
> write this so as to not be inflammatory.  If I fail, I want to
> stress that being inflammatory is not my intent and ask
> forgiveness in advance.
>
> Unicode classifies characters in various ways using a collection
> of categories and properties.  Those categories and properties
> (or at least the vast majority of them) were designed long
> before the IETF started thinking about IDNs; they were certainly
> not optimized for IDNA requirements.  Given that, we should be
> grateful and pleasantly surprised that the properties work as
> well as they do for our purposes.  On the other hand, we should
> not be surprised when, for some group of characters, they do
> not... and that has nothing to do with character by character
> decisions, at least as I understand that term.
>
> Before addressing the Hangul question, let me invent an example
> that is counterfactual, i.e., barring something unforeseen, we
> are unlikely to ever have to deal with it directly.   There is a
> proposal pending for ISO/IEC JTC1/SC2/WG2 to add a number of
> annotation marks for Arabic.  These marks are, according to the
> proposal (with confirmation from independent experts) used
> strictly for pedagogical purposes.   Obviously, if one were
> going to transmit the instructional texts electronically in
> other than page image form, they have to have code points.  They
> are identified in the proposal with General Category "Sk"
> (modifier symbols).  With that classification, the rules in
> "Tables" would automatically place them in DISALLOWED.  But
> suppose the proposal had identified them as modifier letters
> instead (I'm told there is a case to be made for that, even
> though the relevant Unicode folks have --wisely from our point
> of view but perhaps not others-- decided otherwise).  Then we
> would need to exclude them (the whole group, not
> character-by-character) as a backward-compatibility issue
> because otherwise, to quote a colleague, we would have a huge
> mess on our hands, with all sorts of equivalences failing.
> Again, this is _not_ an issue, but it may help in thinking about
> the Hangul problem.
>
> For Hangul, the individual Jamo (again, a clearly-identified
> group of characters, not a character-by-character decision) are
> used to construct conventional (and precomposed) characters
> ("Hangul syllables").  To the extent to which there is an
> analogy in Latin-based script, they would be combining
> characters that combine without a base character.  For
> Latin-based scripts, we don't need to worry about conflicts
> between precomposed characters and composing (base+combining
> character) forms of the same characters because the NFC
> requirement deals with the problem.   For Korean, there is no
> equivalent because NFC doesn't produce the relevant precomposed
> forms.   And, because it doesn't, our problem is not one of
> confusing similarity (a registry problem) but one of having
> comparisons work correctly (a much deeper issue which we have
> generally dealt with in the protocol, in the analogous case by
> the requirement for NFC.  If Unicode had assigned properties
> that treated the Syllables differently from the Jamo, we would
> simply build a rule using those categories and we would not be
> having a discussion about, e.g., "character by character
> decisions".  But there is apparently no such property --both the
> Jamo and the Syllables are in General Category "Lo" and the rest
> of the properties appear to match as well.
>
> I think the situation --and the comparison failures that would
> result if we don't deal with it-- makes a strong case for our
> disallowing either the Jamo or the Syllables.  The ccTLD
> registry and local experts strongly prefer that we disallow the
> Jamo, even though it means that some archaic Syllables and
> fanciful forms are disallowed as a consequence.   I think we
> just defer to them.
>
> Just my opinion, of course.
>
> > The argument that some people will get
> > that registry policy wrong has already been floated, and we
> > rejected it.  Indeed, if we don't reject that premise, then
> > all of the local mapping approach that we've taken should be
> > tossed out, and we should go back to strict mapping in the
> > protocol.
>
> Again, the issue here is one of comparison failures, not of
> confusability or other registry policy questions.
>
>    john
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20081014/2ba8a7f9/attachment-0001.htm 


More information about the Idna-update mailing list