CONTEXTJ Rules (was: Re: draft-ietf-idnabis-tables-06a.txt)

Kenneth Whistler kenw at sybase.com
Tue Jul 21 03:07:02 CEST 2009


Patrik,

> http://stupid.domain.name/stuff/draft-ietf-idnabis-tables-06a.txt

There are very, very serious errors in the two CONTEXTJ rules
for ZWNJ and ZWJ. As currently stated, they are both
underspecified and overspecified, and both rules actually
eliminate *all* of the contexts that would actually
be appropriate for use of ZWNJ or ZWJ in scripts of India.

Furthermore, the two rules rathole on the unrelated problem
of trying to constrain domain names to single scripts,
which serves to accomplish nothing here but to hide the
actual context constraints of relevance to these
two characters.

I would suggest that the two rules drop all reference to constraints
trying to keep all the domain name in a single script -- that
can better be handled by registries than by a protocol rule
at this point.

Here is the 10,000 meter overview of what these rules should
be specifying:

ZWNJ is needed in the Arabic script on occasion to break
a cursive connection. Its useful
context is when it is preceded by a left-joining or dual-joining
character and followed by a right-joining or dual-joining
character (possibly with transparent characters, such
as combining vowel marks, intervening).

ZWJ is *NOT* needed in the Arabic script (for the purposes we
are concerned with).

ZWNJ and ZWJ are both needed in the scripts of India and
Sri Lanka on occasion. Their useful context is in consonant
conjuncts, or more precisely when preceded by a virama
which itself is preceded by a consonant letter.

Note that the contexts as summarized above will already
do a lot to constrain the use of ZWNJ and ZWJ in domain
names to the appropriate scripts. Why? Because the *only*
characters that have right-, left-, or dual-joining properties
are those in cursive scripts that may require an orthographic
break in a cursive connection: Arabic, most importantly, but
also Syriac. Domain name labels in all other scripts would
get automatically dumped by this constraint, because none of
their letters have the requisite cursive properties.

And for Indic scripts, the context is also automatically
constrained by requiring a preceding virama. Only the
relevant scripts have virama characters. Furthermore, each
virama for each script has the corresponding script property,
which means that any higher-level constraint sensitive to
the scripts of labels would see and test any virama. Eliminate
the virama, and you would automatically eliminate any
inappropriate contexts for a ZWNJ or ZWJ.

Given these considerations, the context rules for ZWNJ and
ZWJ and be written much more succinctly, comprehensibly,
simply, and reproducibly. I suggest the following:

=============================================================

Appendix A.2  ZERO WIDTH NON-JOINER
   Code point:
      U+200C
   Overview:
      This may occur in a formally cursive script (such
      as Arabic) in a context where it breaks a cursive
      connection as required for orthographic rules, as
      in the Persian language, for example. It also may
      occur in Indic scripts in a consonant conjunct
      context (immediately following a virama), to
      control required display of such conjuncts.
   Lookup:
      True
   Rule Set:
      False;
      If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
      If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
                     (Joining_Type:T)*(Joining_Type:{R,D})) Then True;

=============================================================

Appendix A.3  ZERO WIDTH JOINER
   Code point:
      U+200C
   Overview:
      This may occur in Indic scripts in a consonant conjunct
      context (immediately following a virama), to
      control required display of such conjuncts.   
   Lookup:
      True
   Rule Set:
      False;
      If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;

=============================================================

Note that I have also turned the default values around for
these rule sets. With the more cleanly defined contexts,
it is much better to default these to False, and then
only return True for the precisely defined exceptional contexts
where they may occur.

I think if the rule sets are stated this way, there is a vastly
greater chance that implementers will implement these context
rules in compatible and interoperable ways. The code will
also be immensely simpler (and less prone to irrelevant
bugs) than as the rules sets are stated in the current draft.

--Ken




More information about the Idna-update mailing list