confusing notation in the ZERO WIDTH NON-JOINER contextual rule

Mon Aug 6 04:31:27 CEST 2012

Hi,

RFC5892 contains the following rule about the contextual validity of U+200C:

> If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
>         (Joining_Type:T)*(Joining_Type:{R,D})) Then True;

By intuition, I understand that "\u200C" within the regex means the code
point in question. So, a feasible interpretation would be:

(*) The code point MUST occur between Joining_Type:{L,D} and
Joining_Type:{R,D}, where arbitrary occurences of Joining_Type:T MAY be
in between.

On the other hand, the statement literally defines just a regex that
should match the string somewhere (with no reference to "cp" as in other
rules), such that the rule would be satisfied already if any U+200C
fulfill the requirement.

The literally interpretation sounds stupid, but I found both variants
within IDNA2008 implementations.

For instance, consider the Perl module Net::IDN::UTS46 on CPAN. Here,
it's taken literally and hence the sequence

  U+0628 U+200C U+0627 U+200C U+0627

is considered to be valid, although U+0627 is Joining_Type:R and thus
the second U+200C doesn't meet the requirement (*).

On the other hand, the (probably more reliable) implementation idnkit-2
from the Japan Registry reports a CONTEXTJ rule violation for the same
string. Now, who is right?

regards, Sebastian