U+200C rule

Patrik Fältström patrik at frobbit.se
Sun Mar 20 14:20:12 CET 2011


On 20 mar 2011, at 11.53, Simon Josefsson wrote:

> Hi.  The rule for U+200C is:
> 
>      False;
> 
>      If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;
> 
>      If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
> 
>         (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
> 
> I could not find any precise definition of how to implement RegExpMatch.
> 
> For example, consider a label that contains two U+200C, where one of the
> U+200C is used in the permitted way, and the other is not.
> 
> A regexp match on that string -- at least with regular expressions as
> defined by POSIX, Emacs, Perl, etc, which are all slightly different --
> would find the positive usage and permit the label.
> 
> Is this the intention?

No

> If not, what is the intended way to implemented RegExpMatch?

The expression try to say that you need around _each_ \u200C the following:

> One codepoint with either Joining_Type L or D
> 
> Zero or more codepoints with Joining_Type T
> 
> The \u200C
> 
> Zero or more codepoints with Joining_Type T
> 
> One codepoint with either Joining_Type R or D

The regexp does not take into account more than one \u200c in each string.

   Patrik





More information about the Idna-update mailing list