confusing notation in the ZERO WIDTH NON-JOINER contextual rule

Paul Hoffman phoffman at imc.org
Wed Aug 8 17:14:54 CEST 2012


It would be good to hear from the document authors on this one. If an errata is needed, filing sooner rather than later would be good.

--Paul Hoffman

On Aug 5, 2012, at 7:31 PM, debug at test1.org wrote:

> Hi,
> 
> RFC5892 contains the following rule about the contextual validity of U+200C:
> 
>> If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
>>        (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
> 
> By intuition, I understand that "\u200C" within the regex means the code
> point in question. So, a feasible interpretation would be:
> 
> (*) The code point MUST occur between Joining_Type:{L,D} and
> Joining_Type:{R,D}, where arbitrary occurences of Joining_Type:T MAY be
> in between.
> 
> On the other hand, the statement literally defines just a regex that
> should match the string somewhere (with no reference to "cp" as in other
> rules), such that the rule would be satisfied already if any U+200C
> fulfill the requirement.
> 
> The literally interpretation sounds stupid, but I found both variants
> within IDNA2008 implementations.
> 
> For instance, consider the Perl module Net::IDN::UTS46 on CPAN. Here,
> it's taken literally and hence the sequence
> 
>  U+0628 U+200C U+0627 U+200C U+0627
> 
> is considered to be valid, although U+0627 is Joining_Type:R and thus
> the second U+200C doesn't meet the requirement (*).
> 
> On the other hand, the (probably more reliable) implementation idnkit-2
> from the Japan Registry reports a CONTEXTJ rule violation for the same
> string. Now, who is right?
> 
> regards, Sebastian
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
> 



More information about the Idna-update mailing list