U+200C rule

Vint Cerf vint at google.com
Sun Mar 20 15:30:30 CET 2011


Paul,

i think patrik correctly interprets the rule as always reading "for
each character"
but your way of stating it makes that perhaps more explicit.

vint

On Sun, Mar 20, 2011 at 10:26 AM, Paul Hoffman <phoffman at imc.org> wrote:
> On Mar 20, 2011, at 6:20 AM, Patrik Fältström wrote:
>
>> On 20 mar 2011, at 11.53, Simon Josefsson wrote:
>>
>>> Hi.  The rule for U+200C is:
>>>
>>>     False;
>>>
>>>     If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;
>>>
>>>     If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
>>>
>>>        (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
>>>
>>> I could not find any precise definition of how to implement RegExpMatch.
>>>
>>> For example, consider a label that contains two U+200C, where one of the
>>> U+200C is used in the permitted way, and the other is not.
>>>
>>> A regexp match on that string -- at least with regular expressions as
>>> defined by POSIX, Emacs, Perl, etc, which are all slightly different --
>>> would find the positive usage and permit the label.
>>>
>>> Is this the intention?
>>
>> No
>>
>>> If not, what is the intended way to implemented RegExpMatch?
>>
>> The expression try to say that you need around _each_ \u200C the following:
>>
>>> One codepoint with either Joining_Type L or D
>>>
>>> Zero or more codepoints with Joining_Type T
>>>
>>> The \u200C
>>>
>>> Zero or more codepoints with Joining_Type T
>>>
>>> One codepoint with either Joining_Type R or D
>>
>> The regexp does not take into account more than one \u200c in each string.
>
> This is maybe an error in RFC 5892, but I am not sure. The rule given in RFC 5892 is:
> =====
>   Rule Set:
>
>      False;
>
>      If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;
>
>      If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
>
>         (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
> =====
>
> Maybe it should instead be:
>
> =====
>   Rule Set:
>
>      False;
>
>      If Canonical_Combining_Class(Before(cp)) .eq.  Virama Then True;
>
>      For All Characters:
>
>         If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
>
>            (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
>      End For;
> =====
>
> Can someone verify whether or not that takes care of Simon's example?
>
> --Paul Hoffman
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>


More information about the Idna-update mailing list