Rules for ZWJ and ZWNJ (Re: Moving Right Along on the Inclusions
Table...)
Harald Alvestrand
harald at alvestrand.no
Thu Dec 21 09:44:04 CET 2006
Mark Davis wrote:
> I've linked to it on several occasions:
> http://www.unicode.org/review/pr-96.html
>
> While it is not completely settled -- it is out for review now and you
> can see the questions we are asking -- I don't see a problem with it
> progressing to the point where we can use it by the time the other
> work we are doing is ready.
Well, one problem with it is that it requires a certain amount of
chasing down references that are obscure to the casual reader.... I'm
paraphrasing the rule below, to see if I understand it:
I read it as saying that ZWJ can occur only after a virama (with a few
more conditions), which is the modifier letters with combining class
(ccc) 9 in the Unicode property tables:
094D;DEVANAGARI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
09CD;BENGALI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0A4D;GURMUKHI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0ACD;GUJARATI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0B4D;ORIYA SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0BCD;TAMIL SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0C4D;TELUGU SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0CCD;KANNADA SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0D4D;MALAYALAM SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
0DCA;SINHALA SIGN AL-LAKUNA;Mn;9;NSM;;;;;N;;;;;
0E3A;THAI CHARACTER PHINTHU;Mn;9;NSM;;;;;N;THAI VOWEL SIGN PHINTHU;;;;
0F84;TIBETAN MARK HALANTA;Mn;9;NSM;;;;;N;TIBETAN VIRAMA;;;;
1039;MYANMAR SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
1714;TAGALOG SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;
1734;HANUNOO SIGN PAMUDPOD;Mn;9;NSM;;;;;N;;;;;
17D2;KHMER SIGN COENG;Mn;9;NSM;;;;;N;;;;;
A806;SYLOTI NAGRI SIGN HASANTA;Mn;9;NSM;;;;;N;;;;;
10A3F;KHAROSHTHI VIRAMA;Mn;9;NSM;;;;;N;;;;;
(I may have missed some, since I found these by "grep". Are there virama
that don't fall into class Mn?)
This is a bit more than Devangari, but they may all be scripts where
this is guaranteed to cause no harm (as seen from an IDNA viewpoint).
Can anyone verify?
A ZWNJ can occur in the same kind of position too (same regexp).
("harm from an IDNA viewpoint" is probably confusability... the question
asked in the -96 file is:
In particular, in which scripts of South East Asia are ZWJ and ZWNJ not
necessary for visual distinctions?
while the classical IDNA question would be:
In partiuclar, are there scripts of South East Asia where ZWJ and ZWNJ
can occur after a virama without causing a visual distinction?)
A ZWNJ may also occur between a Right-joining and a Left-joining
character (either of those may be Dual-joining, too), with possible
embedded Transparent characters.
This property is from ArabicShaping.txt, which says:
# - Those that not explicitly listed that are of General Category Mn,
Me, or Cf
# have joining type T.
None are explicitly listed, so the general categories have to be used
for finding transparent characters. However, all the possible occurences
of Right-joining and Left-joining characters are in ArabicShaping.txt,
so this rule is then limited to the Arabic script. (right?)
So we have 69 right-joining and 170 dual-joining characters in
ArabicShaping.txt - I'm assuming a stability guarantee that no
characters outside of Arabic will be added to this file in the future.
>
> Mark
>
> On 12/20/06, *Harald Alvestrand * <harald at alvestrand.no
> <mailto:harald at alvestrand.no>> wrote:
>
> Mark Davis wrote:
> > Those are all reasonable changes.
> >
> > * We should also add the Joiner/NonJoiner. They would
> however, as
> > discussed, be restricted to very specific contexts by
> additional
> > clauses (like the current bidi restrictions).
> >
> Mark, can you take a stab at writing down those rules?
> I have seen you referred to this as a "solved problem" a couple of
> times, but I haven't seen a specific algorithm proposed yet.
>
>
More information about the Idna-update
mailing list