Table issues (Part 3: CONTEXT)

Thu Dec 6 01:47:42 CET 2007

Patrik,

Next I want to focus specifically on the issues regarding
the newly defined CONTEXT value, and how it is handled in
the algorithm in draft-faltstrom-idnabis-tables-03.txt.

The new Category J "Character Groups Requiring Special
Treatment", in Section 2.2.4, was presumably added in
response to the expressed need to allow *certain*
format characters in IDN's in specific string contexts.

In other words, I am assuming this is associated with the
discussion in the now infamous Unicode PRI #96:

http://www.unicode.org/review/pr-96.html

whose current manifestation is in Section 2.2, "Layout
and Format Control Characters" in the Unicode 5.1
Proposed Update for UAX #31:

http://www.unicode.org/reports/tr31/tr31-8.html#Layout_and_Format_Control_Characters

Category J in draft-faltstrom-idnabis-tables-03.txt,
however, does not line up very well with the exact
list of characters which require special treatment.

UAX #31, Section 2.2 calls out the following characters:

U+200C ZWNJ
U+200D ZWJ
U+202F NNBSP (used as part of words in Mongolian)
U+180B..U+180D  (used as mandatory spelling of some words in Mongolian)

Category J, on the other hand, calls out the following
class:

generalCategory(cp) = (Cf)

The problem here is that the latter definition is very, very
much broader than what UAX #31 is talking about, and includes
literally hundreds of very, very problematical characters,
none of which should EVER occur in IDNs, under any circumstances.

And if anything, the discussion in 2.2.4 implies that the
situation could get worse, noting that the list may be
increased "(for example with Cc)" -- i.e., adding ISO control
characters, which also should  NEVER occur in IDNs.

The latter is actually a contradiction in the spec as it
currently stands, as by the current derivation, all of the
gc=Cc control codes are given the "NEVER" value, so they
would be prohibited from being added to Category J, in
any case.

Now the way Category J is used in the derivation in Section 3,
*all* gc=Cf character in Unicode are given the CONTEXT value.
This is really not a useful outcome at all, because there
is (I think) very strong presumption that nearly all of
them should very clearly be given the NEVER value.

Furthermore, even taken in the context of attempting to
address the concerns raised in UAX #31, Section 2.2,
Category J is currently completely mismatched against
the phenomena in question. To wit:

U+200C ZWNJ
U+200D ZWJ

Those two are gc=Cf, hence in Category J, thus get
assigned "CONTEXT" by the second bullet of phase one of
the Calculation.

U+202F NNBSP

That is gc=Zs, hence not in Category A, and gets
assigned "NEVER" by the first bullet in the Other script branch
of phase two of the Calculation.

U+180B..U+180D Mongolian free variation selectors

Those are gc=Mn, hence in Category A, and get assigned
"MAYBE YES" by the 2nd bullet, first asterisk in the Other
script branch of phase two of the Calculation.

Hence the 3 sets of characters supposedly involved in context
decisions are handled by the algorithm in a completely disjunct
and unmotivated way, despite the addition of a CONTEXT-related
category supposed to deal with this issue.

The outcome for the Mongolian free variation selectors
points out a defect in another part of the specification,
because those are Default_Ignorable_Code_Point, by virtue
of their having the Variation_Selector property, but
Category D got pared back to only include Other_Default_Ignorable_Code_Point
and Noncharacter_Code_Point, so misses all the variation
selectors.

In my opinion, the way this should work is that only
*two* code points should be called out for special
context treatment in the IDNA specs, namely:

U+200C ZWNJ
U+200D ZWJ

The four characters allowed for identifiers in Mongolian
are a very special case for Mongolian, and I see no particular
need to require them in the more restricted area of
IDNs. Requiring them for Mongolian IDNs is tantamount
to requiring apostrophes or spaces in English domain names --
part of the writing system, sure, but not required in the
more restricted field of domain names.

If so, then U+202F --> NEVER, simply by virtue of its
General_Category value gc=Zs. No special handling required.

And, *all* of the Ignorables should also --> NEVER. That
automatically eliminates U+180B..U+180D, along with lots
and lots of other problematical format characters.

The minimally disruptive fix I see for this in the current
draft would be as follows:

Category D (Section 2.1.4) should simply be defined
as {Default_Ignorable_Code_Point}. I don't think the
rationale for not using {Default_Ignorable_Code_Point}
in the section currently is valid -- and the current
definition manifestly ends up with the wrong results
for the table.

If the problem is that this specification is concerned
about the stability of the derivation, then the
derivation for Default_Ignorable_Code_Point could
simply be recapitulated into Category D itself, avoiding
any lack of clarity.

Category J (Section 2.2.4) should then simply list
two code points: U+200C and U+200D, rather than making
any reference to gc=Cf. (Note that gc=Cf is totally
subsumed by the definition of Default_Ignorable_Code_Point.)
If there is a problem with listing specific code
points (even though no new characters will ever be
added to this list), then the *property* you are
looking for is Join_Control. That is a Unicode property
that those two characters and *only* those two characters
share.

If you did those two things, then even without other change,
the Calculation in Section 3 would do the right thing
for the joiners and the ignorables. But to make it
even clearer, you could add a rule in the first phase
(after the rule for Category J), that simply specifies
that all elements of Category D --> NEVER. That's the
same result you would get from leaving the Category D
handling in phase 2, but it would be much more clearly stated 
as an absolute rule up front and away from any script
considerations.

Regards,

--Ken