U+200C rule
Paul Hoffman
phoffman at imc.org
Sun Mar 20 15:26:05 CET 2011
On Mar 20, 2011, at 6:20 AM, Patrik Fältström wrote:
> On 20 mar 2011, at 11.53, Simon Josefsson wrote:
>
>> Hi. The rule for U+200C is:
>>
>> False;
>>
>> If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
>>
>> If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
>>
>> (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
>>
>> I could not find any precise definition of how to implement RegExpMatch.
>>
>> For example, consider a label that contains two U+200C, where one of the
>> U+200C is used in the permitted way, and the other is not.
>>
>> A regexp match on that string -- at least with regular expressions as
>> defined by POSIX, Emacs, Perl, etc, which are all slightly different --
>> would find the positive usage and permit the label.
>>
>> Is this the intention?
>
> No
>
>> If not, what is the intended way to implemented RegExpMatch?
>
> The expression try to say that you need around _each_ \u200C the following:
>
>> One codepoint with either Joining_Type L or D
>>
>> Zero or more codepoints with Joining_Type T
>>
>> The \u200C
>>
>> Zero or more codepoints with Joining_Type T
>>
>> One codepoint with either Joining_Type R or D
>
> The regexp does not take into account more than one \u200c in each string.
This is maybe an error in RFC 5892, but I am not sure. The rule given in RFC 5892 is:
=====
Rule Set:
False;
If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
(Joining_Type:T)*(Joining_Type:{R,D})) Then True;
=====
Maybe it should instead be:
=====
Rule Set:
False;
If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
For All Characters:
If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
(Joining_Type:T)*(Joining_Type:{R,D})) Then True;
End For;
=====
Can someone verify whether or not that takes care of Simon's example?
--Paul Hoffman
More information about the Idna-update
mailing list