CONTEXTJ Rules (was: Re: draft-ietf-idnabis-tables-06a.txt)

Patrik Fältström patrik at frobbit.se
Tue Jul 21 10:12:34 CEST 2009


On 21 jul 2009, at 03.07, Kenneth Whistler wrote:

>> http://stupid.domain.name/stuff/draft-ietf-idnabis-tables-06a.txt
>
> There are very, very serious errors in the two CONTEXTJ rules
> for ZWNJ and ZWJ. As currently stated, they are both
> underspecified and overspecified, and both rules actually
> eliminate *all* of the contexts that would actually
> be appropriate for use of ZWNJ or ZWJ in scripts of India.

Ouch...

I tried to use what I (understood of and) got from Mark...sorry.

> Furthermore, the two rules rathole on the unrelated problem
> of trying to constrain domain names to single scripts,
> which serves to accomplish nothing here but to hide the
> actual context constraints of relevance to these
> two characters.
>
> I would suggest that the two rules drop all reference to constraints
> trying to keep all the domain name in a single script -- that
> can better be handled by registries than by a protocol rule
> at this point.

Ok.

> Here is the 10,000 meter overview of what these rules should
> be specifying:
>
> ZWNJ is needed in the Arabic script on occasion to break
> a cursive connection. Its useful
> context is when it is preceded by a left-joining or dual-joining
> character and followed by a right-joining or dual-joining
> character (possibly with transparent characters, such
> as combining vowel marks, intervening).

Ok.

> ZWJ is *NOT* needed in the Arabic script (for the purposes we
> are concerned with).

Ok.

> ZWNJ and ZWJ are both needed in the scripts of India and
> Sri Lanka on occasion.

You imply Sinhala for example here?

> Their useful context is in consonant
> conjuncts, or more precisely when preceded by a virama
> which itself is preceded by a consonant letter.
>
> Note that the contexts as summarized above will already
> do a lot to constrain the use of ZWNJ and ZWJ in domain
> names to the appropriate scripts. Why? Because the *only*
> characters that have right-, left-, or dual-joining properties
> are those in cursive scripts that may require an orthographic
> break in a cursive connection: Arabic, most importantly, but
> also Syriac. Domain name labels in all other scripts would
> get automatically dumped by this constraint, because none of
> their letters have the requisite cursive properties.

Ok.

> And for Indic scripts, the context is also automatically
> constrained by requiring a preceding virama. Only the
> relevant scripts have virama characters. Furthermore, each
> virama for each script has the corresponding script property,
> which means that any higher-level constraint sensitive to
> the scripts of labels would see and test any virama. Eliminate
> the virama, and you would automatically eliminate any
> inappropriate contexts for a ZWNJ or ZWJ.
>
> Given these considerations, the context rules for ZWNJ and
> ZWJ and be written much more succinctly, comprehensibly,
> simply, and reproducibly. I suggest the following:

I actually think I understand this description. Thanks Ken.

I will accept this suggestion as a change for the 06b version.

> =============================================================
>
> Appendix A.2  ZERO WIDTH NON-JOINER
>   Code point:
>      U+200C
>   Overview:
>      This may occur in a formally cursive script (such
>      as Arabic) in a context where it breaks a cursive
>      connection as required for orthographic rules, as
>      in the Persian language, for example. It also may
>      occur in Indic scripts in a consonant conjunct
>      context (immediately following a virama), to
>      control required display of such conjuncts.
>   Lookup:
>      True
>   Rule Set:
>      False;
>      If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
>      If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
>                     (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
>
> =============================================================
>
> Appendix A.3  ZERO WIDTH JOINER
>   Code point:
>      U+200C
>   Overview:
>      This may occur in Indic scripts in a consonant conjunct
>      context (immediately following a virama), to
>      control required display of such conjuncts.
>   Lookup:
>      True
>   Rule Set:
>      False;
>      If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
>
> =============================================================

Fixed!

> Note that I have also turned the default values around for
> these rule sets. With the more cleanly defined contexts,
> it is much better to default these to False, and then
> only return True for the precisely defined exceptional contexts
> where they may occur.

Noted.

> I think if the rule sets are stated this way, there is a vastly
> greater chance that implementers will implement these context
> rules in compatible and interoperable ways. The code will
> also be immensely simpler (and less prone to irrelevant
> bugs) than as the rules sets are stated in the current draft.

     Patrik



More information about the Idna-update mailing list