U+200C rule

Paul Hoffman phoffman at imc.org
Sun Mar 20 15:26:05 CET 2011


On Mar 20, 2011, at 6:20 AM, Patrik Fältström wrote:

> On 20 mar 2011, at 11.53, Simon Josefsson wrote:
> 
>> Hi.  The rule for U+200C is:
>> 
>>     False;
>> 
>>     If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;
>> 
>>     If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
>> 
>>        (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
>> 
>> I could not find any precise definition of how to implement RegExpMatch.
>> 
>> For example, consider a label that contains two U+200C, where one of the
>> U+200C is used in the permitted way, and the other is not.
>> 
>> A regexp match on that string -- at least with regular expressions as
>> defined by POSIX, Emacs, Perl, etc, which are all slightly different --
>> would find the positive usage and permit the label.
>> 
>> Is this the intention?
> 
> No
> 
>> If not, what is the intended way to implemented RegExpMatch?
> 
> The expression try to say that you need around _each_ \u200C the following:
> 
>> One codepoint with either Joining_Type L or D
>> 
>> Zero or more codepoints with Joining_Type T
>> 
>> The \u200C
>> 
>> Zero or more codepoints with Joining_Type T
>> 
>> One codepoint with either Joining_Type R or D
> 
> The regexp does not take into account more than one \u200c in each string.

This is maybe an error in RFC 5892, but I am not sure. The rule given in RFC 5892 is:
=====
   Rule Set:

      False;

      If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;

      If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C

         (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
=====

Maybe it should instead be:

=====
   Rule Set:

      False;

      If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;

      For All Characters:

         If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C

            (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
      End For;
=====

Can someone verify whether or not that takes care of Simon's example?

--Paul Hoffman


More information about the Idna-update mailing list