Table issues (Part 2)

Wed Dec 5 01:42:39 CET 2007

Patrik,

Some more feedback on draft-faltstrom-idnabis-tables-03.txt.

Section 4. Codepoints states:

"The Categories and Rules defined in Section 2 and Section 3 apply to
all assigned Unicode characters."

In fact they also apply to *unassigned* code points as well.

I think the correct formulation would be:

"The Categories and Rules defined in Section 2 and Section 3 apply to
all Unicode codepoints, assigned or unassigned."

[Note: the Unicode Standard systematically uses a space
in the term "code point", as well as for "code unit",
"code position", "code value", etc. But given that this
document uses "codepoint" everywhere, I'm not suggesting
that be changed. Nobody is going to be confused as to
what the word means.]

Section 1. Introduction

Description of ALWAYS states:

"Once assigned to this category, a character is never removed
from it unless it is removed from Unicode."

The qualification "unless it is removed from Unicode" is
vacuous. Since Unicode 1.1, no character ever has been
removed from Unicode, nor will any be -- in part because
no character will ever be removed from ISO/IEC 10646.

So this is a quibble is a little like qualifying the
definition of ASCII LDH as "{0061..007A, 0030..0039, 002D}
and no characters will be removed from this definition
unless they are removed from ASCII."

So I suggest just removing the vacuous qualification.

Large paragraph in the middle of page 3 state:

"It should be suitable for newer revisions of Unicode, as 
long as the Unicode properties on which it is based remain stable."

This points out a fundamental problem with the categories
used for this IDN property, in that it depends crucially
on stability of the Script property. However, unlike
a number of the Unicode (immutable) properties and
algorithms (such as normalization) for which there are
stability guarantees that the Unicode Consortium takes
very seriously, there is *NO* stability guarantee for
the Script property. To assume that there is one is
setting this specification directly on a collision course
with the UTC at some point in the future.

Let me illustrate this with a *real* (not hypothetical)
issue which just came up on the unicode list discussions
during the last week.

Egyptologists are considering specifying the 
U+0486 COMBINING CYRILLIC PSILI PNEUMATA for their
use of the Egyptological yod in transliteration. The
character has the correct semantics and shape, and
using it would avoid a potential shaping and positioning
issue with U+0313 COMBINING COMMA ABOVE. But if they
start using it, they would be using it in combination
with *Latin* characters, not Old Church Slavonic (Cyrillic)
characters. In part to avoid potential font selection
heuristic issues, they are now requesting that
U+0486 have its Script property be changed from
Script=Cyrillic (its current value in Unicode 5.0)
to Script=Inherited (in Unicode 5.1). This is a perfectly
reasonable type of change, from the UTC's point of view,
and parallels other changes made to script properties
when a character once seen as belonging only to a single
script comes to have an application in another script,
in which case it might change to Script=Common or
Script=Inherited, depending on the case.

Now consider the implications for the Calculation of
the Derived Property in Section 3 of 
draft-falstrom-idnabis-tables-03.txt. By the Unicode 5.0
property values, U+0486 would be determined to be ALWAYS.
And in fact, Appendix A has the resulting entry:

0483..0486 ; ALWAYS    # COMBINING CYRILLIC TITLO..COMBINING CYRILLIC PS

*But*, if the UTC deals with the reasonable request from
the Egyptologists by changing U+0486 to Script=Inherited
or Script=Common, then by the Calculation in Section 3,
U+0486 would end up instead as MAYBE YES.

ALWAYS ==> MAYBE YES is a prohibited state transition
for this IDN property.

That is a very bad situation to be in for this specification.
Some property change that is perfectly normal and reasonable
for the UTC would end up resulting in what an IETF
protocol would be claiming is an absolutely prohibited
change. We'd either end up with finger-pointing and blame
going back and forth regarding who is responsible for
the destabilization, or else end up hacking up emergency
fixes to the IDNA RFC, so that the calculation of the
derived property itself could be stabilized against
future changes to properties it depends on.

Meanwhile, the constituencies of character users that
come to the UTC -- like these Egyptologists -- would
be flabbergasted, flustered, and highly annoyed, if
they were simply met by intransigence on the UTC's
part, saying, "Sorry, we can't change that Script
property for you, because we are absolutely bound by
a stability guarantee for IDNs. Changing that property
would destabilize IDNA, and we can't do that."
They would neither understoodd nor would they stand for what would
seem to them to be a completely arbitrary and
unmotivated stonewalling on the issue. It's bad enough
the crap we have to put up with in maintaining stability
for normalization, for which the UTC is highly motivated
to maintain stability for many reasons besides its
use in IETF protocols.

I really think the whole MAYBE YES category based on
Script determinations is a trap for this specification.
It is attempting to be conservative in making commitments
regarding the use of certain characters in IDN. But
in using the Script property distinctions the way
Section 3 current is doing, all it really is going
to accomplish is to *guarantee*, in the not too distant
future, a stability meltdown for the ALWAYS and NEVER
categories that the specification is trying to define
and claim stability for.

Regards,

--Ken