Rules for ZWJ and ZWNJ (Re: Moving Right Along on the Inclusions Table...)

Harald Alvestrand harald at alvestrand.no
Thu Dec 21 09:44:04 CET 2006


Mark Davis wrote:
> I've linked to it on several occasions: 
> http://www.unicode.org/review/pr-96.html
>
> While it is not completely settled -- it is out for review now and you 
> can see the questions we are asking -- I don't see a problem with it 
> progressing to the point where we can use it by the time the other 
> work we are doing is ready.
Well, one problem with it is that it requires a certain amount of 
chasing down references that are obscure to the casual reader.... I'm 
paraphrasing the rule below, to see if I understand it:

I read it as saying that ZWJ can occur only after a virama (with a few 
more conditions), which is the modifier letters with combining class 
(ccc) 9 in the Unicode property tables:

094D;DEVANAGARI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
09CD;BENGALI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0A4D;GURMUKHI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0ACD;GUJARATI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0B4D;ORIYA SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0BCD;TAMIL SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0C4D;TELUGU SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0CCD;KANNADA SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0D4D;MALAYALAM SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0DCA;SINHALA SIGN AL-LAKUNA;Mn;9;NSM;;;;;N;;;;;
0E3A;THAI CHARACTER PHINTHU;Mn;9;NSM;;;;;N;THAI VOWEL SIGN PHINTHU;;;;
0F84;TIBETAN MARK HALANTA;Mn;9;NSM;;;;;N;TIBETAN VIRAMA;;;;
1039;MYANMAR SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
1714;TAGALOG SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
1734;HANUNOO SIGN PAMUDPOD;Mn;9;NSM;;;;;N;;;;;
17D2;KHMER SIGN COENG;Mn;9;NSM;;;;;N;;;;;
A806;SYLOTI NAGRI SIGN HASANTA;Mn;9;NSM;;;;;N;;;;;
10A3F;KHAROSHTHI VIRAMA;Mn;9;NSM;;;;;N;;;;;

(I may have missed some, since I found these by "grep". Are there virama 
that don't fall into class Mn?)
This is a bit more than Devangari, but they may all be scripts where 
this is guaranteed to cause no harm (as seen from an IDNA viewpoint). 
Can anyone verify?

A ZWNJ can occur in the same kind of position too (same regexp).

("harm from an IDNA viewpoint" is probably confusability... the question 
asked in the -96 file is:

In particular, in which scripts of South East Asia are ZWJ and ZWNJ not 
necessary for visual distinctions?

while the classical IDNA question would be:

In partiuclar, are there scripts of South East Asia where ZWJ and ZWNJ 
can occur after a virama without causing a visual distinction?)


A ZWNJ may also occur between a Right-joining and a Left-joining 
character (either of those may be Dual-joining, too), with possible 
embedded Transparent characters.

This property is from ArabicShaping.txt, which says:
# - Those that not explicitly listed that are of General Category Mn, 
Me, or Cf
#   have joining type T.
None are explicitly listed, so the general categories have to be used 
for finding transparent characters. However, all the possible occurences 
of Right-joining and Left-joining characters are in ArabicShaping.txt, 
so this rule is then limited to the Arabic script. (right?)

So we have 69 right-joining and 170 dual-joining characters in 
ArabicShaping.txt - I'm assuming a stability guarantee that no 
characters outside of Arabic will be added to this file in the future.

>
> Mark
>
> On 12/20/06, *Harald Alvestrand * <harald at alvestrand.no 
> <mailto:harald at alvestrand.no>> wrote:
>
>     Mark Davis wrote:
>     > Those are all reasonable changes.
>     >
>     >     * We should also add the Joiner/NonJoiner. They would
>     however, as
>     >       discussed, be restricted to very specific contexts by
>     additional
>     >       clauses (like the current bidi restrictions).
>     >
>     Mark, can you take a stab at writing down those rules?
>     I have seen you referred to this as a "solved problem" a couple of
>     times, but I haven't seen a specific algorithm proposed yet.
>
>



More information about the Idna-update mailing list