confusing notation in the ZERO WIDTH NON-JOINER contextual rule

Simon Josefsson simon at josefsson.org
Wed Aug 8 22:38:51 CEST 2012


This issue has been discussed before, I brought this up when writing my
IDNA2008 implementation.  One of the authors clarified the intention,
see:

http://thread.gmane.org/gmane.ietf.idnabis/6980

Alas, no errata was filed at the time.  I believe filing an errata is
called for, since 1) the text in the document was underspecified from
the beginning and 2) we now have differing implementations out there.

/Simon

Paul Hoffman <phoffman at imc.org> writes:

> It would be good to hear from the document authors on this one. If an
> errata is needed, filing sooner rather than later would be good.
>
> --Paul Hoffman
>
> On Aug 5, 2012, at 7:31 PM, debug at test1.org wrote:
>
>> Hi,
>> 
>> RFC5892 contains the following rule about the contextual validity of U+200C:
>> 
>>> If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
>>>        (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
>> 
>> By intuition, I understand that "\u200C" within the regex means the code
>> point in question. So, a feasible interpretation would be:
>> 
>> (*) The code point MUST occur between Joining_Type:{L,D} and
>> Joining_Type:{R,D}, where arbitrary occurences of Joining_Type:T MAY be
>> in between.
>> 
>> On the other hand, the statement literally defines just a regex that
>> should match the string somewhere (with no reference to "cp" as in other
>> rules), such that the rule would be satisfied already if any U+200C
>> fulfill the requirement.
>> 
>> The literally interpretation sounds stupid, but I found both variants
>> within IDNA2008 implementations.
>> 
>> For instance, consider the Perl module Net::IDN::UTS46 on CPAN. Here,
>> it's taken literally and hence the sequence
>> 
>>  U+0628 U+200C U+0627 U+200C U+0627
>> 
>> is considered to be valid, although U+0627 is Joining_Type:R and thus
>> the second U+200C doesn't meet the requirement (*).
>> 
>> On the other hand, the (probably more reliable) implementation idnkit-2
>> from the Japan Registry reports a CONTEXTJ rule violation for the same
>> string. Now, who is right?
>> 
>> regards, Sebastian
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>> 


More information about the Idna-update mailing list