bidi spec

Vint Cerf vint at google.com
Thu Feb 7 14:19:31 CET 2008


just a small suggestion: minimization of rules may conflict with  
clarity and understanding. Perhaps one could distinguish the "rules"  
from "an implementation of the rules" that does try to fold some  
tests together?

v

On Feb 7, 2008, at 2:01 AM, Mark Davis wrote:

> I don't think these are the minimal rules yet, since some *parts*  
> of these rules can be removed. Here are the ones you say to keep:
>
> Proposed rules (numbering for reference)"
> Only characters with the BIDI properties L, R, AL, EN, ES, BN, ON  
> and NSM are allowed.
> ES and ON are not allowed in the first position
> ES and ON, followed by zero or more NSM, is not allowed in the last  
> position
> If an L is present, no R, AL or AN may be present.
> If an EN is present, no AN may be present
> The first character may not be an NSM
> The first character may not be an EN (European Number) or an AN  
> (Arabic Number).
> Comments
> All of this applies only to Bidi labels: that is, those with BIDI  
> properties R, AL, or AN. Because of that, #4 can be changed to be  
> simply: No L.
> According to #1, AN is not allowed at all. That would remove #5,  
> and remove AN from #4 and #7. However, I think that's a mistake --  
> as discussed before--  that we need to include AN in #1.
> Because the protocol limits to only [[:L:][:Mn:][:Mc:][:Nd:]] plus  
> a handful of exceptions, #1 is redundant because of the other  
> restrictions in the protocol document. If it is retained, we should  
> at least have a note about that.
> The only characters allowed in NSM are [:Mn:], and [:Me:]. The  
> protocol forbits [:Me:] entirely, and forbids [:Mc:] in first  
> position. So #6 is redundant. If it is retained, we should at least  
> have a note about that.
> #7 can be combined with #2; all about the first character.
> Assuming the rest are required (although it might be worth trying  
> the removal of individual properties to make sure all are), here is  
> what a minimal list would look like, I think:
>
> For any label containing a character with BIDI property values R,  
> AL, or AN:
> No character can have the BIDI property value of L.
> The first character cannot have BIDI property value ES, ON, EN, AN.
> The label cannot end with a sequence where the first character has  
> BIDI property values ES or ON, and the remainder is a sequence of  
> zero or more characters with BIDI property value NSM
> The label cannot contain two characters where one has BIDI property  
> value EN and the other has BIDI property value AN
> This would be equivalent to saying in regex notation:
>
> If the label matches:
> .*([:bc=R:][:bc=AL:][:bc=AN:]).*
> Then it MUST NOT also match:
> .*[:bc=L:].*
> | [[:bc=ES:][:bc=ON:][:bc=EN:][:bc=AN:]].*
> |.*([:bc=ES:][:bc=ON:])[:bc=NSM:]*
> |.*[:bc=EN:].*[:bc=AN:].*
> |.*[:bc=AN:].*[:bc=EN:].*
> Would need to verify that this matches all the exclusions above, of  
> course. You can do that with ICU's regex, if you are feeling  
> adventurous! You might have to use Perl syntax for the properties,  
> eg \p{bc=L} and so on.
>
> Mark
>
> On Feb 6, 2008 9:53 PM, Erik van der Poel <erikv at google.com> wrote:
> It turns out that ICU4C has an option called
> UBIDI_KEEP_BASE_COMBINING, and this did the trick (putting the base
> character and combining mark in the "correct" order). When I did that,
> I found that I no longer needed the new rule that I mentioned in my
> previous email. You're also right that "If an R, AL or AN is present,
> no L may be present" is simply redundant (same as the previous rule).
> So here are all the rules, with "Keep" and "Remove" indicated:
>
> Keep:
> Only characters with the BIDI properties L, R, AL, EN, ES, BN, ON
> and NSM are allowed.
>
> Keep:
> ES and ON are not allowed in the first position
>
> Keep:
> ES and ON, followed by zero or more NSM, is not allowed in the
> last position
>
> Keep:
> If an L is present, no R, AL or AN may be present.
>
> Remove (redundant):
> If an R, AL or AN is present, no L may be present.
>
> Keep:
> If an EN is present, no AN may be present
>
> Remove (redundant):
> If an AN is present, no EN may be present
>
> Remove:
> If an AN is present, at least one R or AL must be present
>
> Keep:
> The first character may not be an NSM
>
> Keep:
> The first character may not be an EN (European Number) or an AN
> (Arabic Number).
>
> Harald, are you also able to remove the rules that I marked "Remove"
> above, and still have the tests pass?
>
> Erik
>
> On Feb 6, 2008 7:52 PM, Mark Davis <mark.davis at icu-project.org> wrote:
> > Here is what is happening with that. The BIDI algorithm is  
> designed for
> > display, and in display, the NSMs are designed to follow their  
> base -- in
> > display order. That means an NSM following an R character will  
> come after it
> > within its level (odd). So that is covered by the following rule:
> >
> >
> >
> >
> > L3. Combining marks applied to a right-to-left base character  
> will at this
> > point precede their base character. If the rendering engine  
> expects them to
> > follow the base characters in the final display process, then the  
> ordering
> > of the marks and the base character must be reversed.
> >
> >
> >
> > What you are not seeing when you just look at the text is that  
> what the bidi
> > algorithm actually produces is a series of levels associated with  
> the text.
> > That level information is available at the time of L3 for the  
> display engine
> > to use.
> >
> > So what you need to do is make your test do L3 (not done by ICU,  
> since it is
> > targeted at display layout). So you need to make one more pass  
> through each
> > segment of text that is at an odd level, and reverse any sequence  
> matching
> > the following regex:
> >  /[:bc=NSM:]+ [:^bc=NSM:]/
> >
> > =================
> >
> > I haven't looked in detail at "If an R, AL or AN is present, no L  
> may be
> > present.", and don't have the other rules handy here -- it may be  
> redundant
> > with them. But just to restate the test case:
> >
> > For the collision test, your test should be checking two  
> environments:
> >
> > a) RLM + test_string
> > b) LRM + test_string
> >
> > You'll test a series of strings that cover all the combinations.  
> There is a
> > collision failure if string1 has the same bidi results in either  
> (a) or (b)
> > as string2 in either (a) or (b).
> >
> > With that, if I have
> >
> > test_string1 = Ab
> > test_string2 = bA
> >
> > I will get:
> >
> > test_string1 in (a) => bA
> > test_string2 in (b) => bA
> >
> > thus a collision. I also get a collision with
> >
> > test_string1 in (b) => Ab
> >  test_string2 in (a) => Ab
> >
> > As I said, though, I don't have the other rules handy here -- "If  
> an R, AL
> > or AN is present, no L may be present." may be just redundant.
> >
> > Mark
> >
>
>
>
> -- 
> Mark
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080207/b5e83036/attachment.html


More information about the Idna-update mailing list