Consensus Call Tranche 8 (Character Adjustments)

Tue Oct 14 22:17:56 CEST 2008

--On Tuesday, 14 October, 2008 13:22 -0400 Andrew Sullivan
<ajs at commandprompt.com> wrote:

> On Sun, Oct 12, 2008 at 05:25:27AM -0400, Vint Cerf wrote:
>> Consensus Call Tranche 8 (character adjustments)
>> 
>> Place your reply here: [NO]
>> 
>> COMMENTS:
> 
>> (8.a) Make Eszett Protocol-Valid per list discussion.
>> 
>> (8.b) Make Greek final sigma Protocol-Valid per list
>> discussion.
> 
> Since the call is all-or-nothing, I have to respond "no".  On
> these two, I have no opinion; I don't feel sufficiently
> qualified to say whether these individual characters should be
> altered.  My understanding is that, because they are
> consistent with the tables approach that we are taking, the
> only reason to exclude them would be historical.

For whatever my opinion is worth, exactly.

>  Since the
> unhappiness with some of those historical decisions is part of
> the justification for the current work, it seems to me that
> these ought to be allowed (although I wonder whether 8.b ought
> to have a context rule).

Could you explain why you would require a context rule for Final
Sigma without requiring one for Eszett?  Certainly it would be
easier to specify a rule for the former ("Script=Greek") while
the latter would presumably either require either "Script=Latin"
(which wouldn't do much good) or an enumerated list of
characters.  One can't require that the character actually
appear in the last position in a label without preventing people
from constructing labels by cramming words together... any
prohibition along _those_ lines should certainly be a registry
decision, IMO.

For the record (and context when that discussion re-emerges on
the list), at least some of the Greek IDN community would prefer
that we preserve the IDNA2003 mapping / case-folding behavior
for final sigma even if that is the only required mapping in
IDNA2008.

>> (8.c) Disallow conjoining Hangul jamo per recommendation from
>> KRNIC and others, permitting only precomposed syllables.
> 
> This appears to open the character-by-character decision
> making that we already ruled out.  As Mark Davis argues, if we
> accept this restriction then we probably need to re-open the
> discussions about obsolete scripts, &c.  It sounds to me very
> like a registry policy. 

Let me try to explain the other point of view, to the extent to
which I understand the issues as they have been explained to me
by the group associated with the Korean registry (if I have it
wrong, I hope they will step in directly).  I am going to try to
write this so as to not be inflammatory.  If I fail, I want to
stress that being inflammatory is not my intent and ask
forgiveness in advance.

Unicode classifies characters in various ways using a collection
of categories and properties.  Those categories and properties
(or at least the vast majority of them) were designed long
before the IETF started thinking about IDNs; they were certainly
not optimized for IDNA requirements.  Given that, we should be
grateful and pleasantly surprised that the properties work as
well as they do for our purposes.  On the other hand, we should
not be surprised when, for some group of characters, they do
not... and that has nothing to do with character by character
decisions, at least as I understand that term.  

Before addressing the Hangul question, let me invent an example
that is counterfactual, i.e., barring something unforeseen, we
are unlikely to ever have to deal with it directly.   There is a
proposal pending for ISO/IEC JTC1/SC2/WG2 to add a number of
annotation marks for Arabic.  These marks are, according to the
proposal (with confirmation from independent experts) used
strictly for pedagogical purposes.   Obviously, if one were
going to transmit the instructional texts electronically in
other than page image form, they have to have code points.  They
are identified in the proposal with General Category "Sk"
(modifier symbols).  With that classification, the rules in
"Tables" would automatically place them in DISALLOWED.  But
suppose the proposal had identified them as modifier letters
instead (I'm told there is a case to be made for that, even
though the relevant Unicode folks have --wisely from our point
of view but perhaps not others-- decided otherwise).  Then we
would need to exclude them (the whole group, not
character-by-character) as a backward-compatibility issue
because otherwise, to quote a colleague, we would have a huge
mess on our hands, with all sorts of equivalences failing.
Again, this is _not_ an issue, but it may help in thinking about
the Hangul problem.

For Hangul, the individual Jamo (again, a clearly-identified
group of characters, not a character-by-character decision) are
used to construct conventional (and precomposed) characters
("Hangul syllables").  To the extent to which there is an
analogy in Latin-based script, they would be combining
characters that combine without a base character.  For
Latin-based scripts, we don't need to worry about conflicts
between precomposed characters and composing (base+combining
character) forms of the same characters because the NFC
requirement deals with the problem.   For Korean, there is no
equivalent because NFC doesn't produce the relevant precomposed
forms.   And, because it doesn't, our problem is not one of
confusing similarity (a registry problem) but one of having
comparisons work correctly (a much deeper issue which we have
generally dealt with in the protocol, in the analogous case by
the requirement for NFC.  If Unicode had assigned properties
that treated the Syllables differently from the Jamo, we would
simply build a rule using those categories and we would not be
having a discussion about, e.g., "character by character
decisions".  But there is apparently no such property --both the
Jamo and the Syllables are in General Category "Lo" and the rest
of the properties appear to match as well.

I think the situation --and the comparison failures that would
result if we don't deal with it-- makes a strong case for our
disallowing either the Jamo or the Syllables.  The ccTLD
registry and local experts strongly prefer that we disallow the
Jamo, even though it means that some archaic Syllables and
fanciful forms are disallowed as a consequence.   I think we
just defer to them.

Just my opinion, of course.

> The argument that some people will get
> that registry policy wrong has already been floated, and we
> rejected it.  Indeed, if we don't reject that premise, then
> all of the local mapping approach that we've taken should be
> tossed out, and we should go back to strict mapping in the
> protocol.

Again, the issue here is one of comparison failures, not of
confusability or other registry policy questions.

    john