confusing notation in the ZERO WIDTH NON-JOINER contextual rule
debug at test1.org
debug at test1.org
Mon Aug 6 04:31:27 CEST 2012
Hi,
RFC5892 contains the following rule about the contextual validity of U+200C:
> If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
> (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
By intuition, I understand that "\u200C" within the regex means the code
point in question. So, a feasible interpretation would be:
(*) The code point MUST occur between Joining_Type:{L,D} and
Joining_Type:{R,D}, where arbitrary occurences of Joining_Type:T MAY be
in between.
On the other hand, the statement literally defines just a regex that
should match the string somewhere (with no reference to "cp" as in other
rules), such that the rule would be satisfied already if any U+200C
fulfill the requirement.
The literally interpretation sounds stupid, but I found both variants
within IDNA2008 implementations.
For instance, consider the Perl module Net::IDN::UTS46 on CPAN. Here,
it's taken literally and hence the sequence
U+0628 U+200C U+0627 U+200C U+0627
is considered to be valid, although U+0627 is Joining_Type:R and thus
the second U+200C doesn't meet the requirement (*).
On the other hand, the (probably more reliable) implementation idnkit-2
from the Japan Registry reports a CONTEXTJ rule violation for the same
string. Now, who is right?
regards, Sebastian
More information about the Idna-update
mailing list