PR-96, ZWNJ and Arabic, again....

Harald Alvestrand hta at google.com
Fri Dec 22 09:33:57 CET 2006


Sigh. The more times I go over this, the less I understand.

The PR-96 text says:

   1. *Breaking a cursive connection. *That is, in the context based on
      the Arabic Shaping property, consisting of:
          * A Right-Joining character, followed by zero or more
            Transparent characters, followed by a ZWNJ, followed by zero
            or more Transparent characters, followed by a Left-Joining
            character
          * As a regular expression:

            /$R $T* ZWNJ $T* $L/
            where:
             
                o $T = [:Joining_Type=Transparent:]
                o $R = [[:Joining_Type=Dual_Joining:][:
                  Joining_Type=Right_Joining:]]
                o $L =
                  [[:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:]]
                   
          * Example: Farsi <Noon, Alef, Meem, Heh, Alef, Farsi Yeh>.
            Without a ZWNJ, it translates to "names"; with a ZWNJ
            between Heh and Alef, it means "a letter".
             

Straightforward?

Not quite. Of those characters, Alef is RIGHT-joining; all the others 
are dual-joining.

So the pattern $R $T* ZWNJ $T* $L will NOT match the sequence "Heh ZWNJ 
Alef".
Either the example string is given in visual order, the regexp is 
intended to be read as visual order,  something is wrong, or I'm 
horribly confused.

Help?





More information about the Idna-update mailing list