bidi spec

Mark Davis mark.davis at icu-project.org
Thu Feb 7 08:01:20 CET 2008


I don't think these are the minimal rules yet, since some *parts* of these
rules can be removed. Here are the ones you say to keep:

Proposed rules (numbering for reference)"

   1. Only characters with the BIDI properties L, R, AL, EN, ES, BN, ON
   and NSM are allowed.
   2. ES and ON are not allowed in the first position
   3. ES and ON, followed by zero or more NSM, is not allowed in the last
   position
   4. If an L is present, no R, AL or AN may be present.
   5. If an EN is present, no AN may be present
   6. The first character may not be an NSM
   7. The first character may not be an EN (European Number) or an AN
   (Arabic Number).

Comments

   1. All of this applies only to Bidi labels: that is, those with BIDI
   properties R, AL, or AN. Because of that, #4 can be changed to be simply: No
   L.
   2. According to #1, AN is not allowed at all. That would remove #5,
   and remove AN from #4 and #7. However, I think that's a mistake -- as
   discussed before--  that we need to include AN in #1.
   3. Because the protocol limits to only [[:L:][:Mn:][:Mc:][:Nd:]] plus
   a handful of exceptions, #1 is redundant because of the other restrictions
   in the protocol document. If it is retained, we should at least have a note
   about that.
   4. The only characters allowed in NSM are [:Mn:], and [:Me:]. The
   protocol forbits [:Me:] entirely, and forbids [:Mc:] in first position. So
   #6 is redundant. If it is retained, we should at least have a note about
   that.
   5. #7 can be combined with #2; all about the first character.

Assuming the rest are required (although it might be worth trying the
removal of individual properties to make sure all are), here is what a
minimal list would look like, I think:

For any label containing a character with BIDI property values R, AL, or AN:

   1. No character can have the BIDI property value of L.
   2. The first character cannot have BIDI property value ES, ON, EN, AN.
   3. The label cannot end with a sequence where the first character has
   BIDI property values ES or ON, and the remainder is a sequence of zero or
   more characters with BIDI property value NSM
   4. The label cannot contain two characters where one has BIDI property
   value EN and the other has BIDI property value AN

This would be equivalent to saying in regex notation:

If the label matches:

   - .*([:bc=R:][:bc=AL:][:bc=AN:]).*

Then it MUST NOT also match:

   - .*[:bc=L:].*
   | [[:bc=ES:][:bc=ON:][:bc=EN:][:bc=AN:]].*
   |.*([:bc=ES:][:bc=ON:])[:bc=NSM:]*
   |.*[:bc=EN:].*[:bc=AN:].*
   |.*[:bc=AN:].*[:bc=EN:].*

Would need to verify that this matches all the exclusions above, of course.
You can do that with ICU's regex, if you are feeling adventurous! You might
have to use Perl syntax for the properties, eg \p{bc=L} and so on.

Mark

On Feb 6, 2008 9:53 PM, Erik van der Poel <erikv at google.com> wrote:

> It turns out that ICU4C has an option called
> UBIDI_KEEP_BASE_COMBINING, and this did the trick (putting the base
> character and combining mark in the "correct" order). When I did that,
> I found that I no longer needed the new rule that I mentioned in my
> previous email. You're also right that "If an R, AL or AN is present,
> no L may be present" is simply redundant (same as the previous rule).
> So here are all the rules, with "Keep" and "Remove" indicated:
>
> Keep:
> Only characters with the BIDI properties L, R, AL, EN, ES, BN, ON
> and NSM are allowed.
>
> Keep:
> ES and ON are not allowed in the first position
>
> Keep:
> ES and ON, followed by zero or more NSM, is not allowed in the
> last position
>
> Keep:
> If an L is present, no R, AL or AN may be present.
>
> Remove (redundant):
> If an R, AL or AN is present, no L may be present.
>
> Keep:
> If an EN is present, no AN may be present
>
> Remove (redundant):
> If an AN is present, no EN may be present
>
> Remove:
> If an AN is present, at least one R or AL must be present
>
> Keep:
> The first character may not be an NSM
>
> Keep:
> The first character may not be an EN (European Number) or an AN
> (Arabic Number).
>
> Harald, are you also able to remove the rules that I marked "Remove"
> above, and still have the tests pass?
>
> Erik
>
> On Feb 6, 2008 7:52 PM, Mark Davis <mark.davis at icu-project.org> wrote:
> > Here is what is happening with that. The BIDI algorithm is designed for
> > display, and in display, the NSMs are designed to follow their base --
> in
> > display order. That means an NSM following an R character will come
> after it
> > within its level (odd). So that is covered by the following rule:
> >
> >
> >
> >
> > L3. Combining marks applied to a right-to-left base character will at
> this
> > point precede their base character. If the rendering engine expects them
> to
> > follow the base characters in the final display process, then the
> ordering
> > of the marks and the base character must be reversed.
> >
> >
> >
> > What you are not seeing when you just look at the text is that what the
> bidi
> > algorithm actually produces is a series of levels associated with the
> text.
> > That level information is available at the time of L3 for the display
> engine
> > to use.
> >
> > So what you need to do is make your test do L3 (not done by ICU, since
> it is
> > targeted at display layout). So you need to make one more pass through
> each
> > segment of text that is at an odd level, and reverse any sequence
> matching
> > the following regex:
> >  /[:bc=NSM:]+ [:^bc=NSM:]/
> >
> > =================
> >
> > I haven't looked in detail at "If an R, AL or AN is present, no L may be
> > present.", and don't have the other rules handy here -- it may be
> redundant
> > with them. But just to restate the test case:
> >
> > For the collision test, your test should be checking two environments:
> >
> > a) RLM + test_string
> > b) LRM + test_string
> >
> > You'll test a series of strings that cover all the combinations. There
> is a
> > collision failure if string1 has the same bidi results in either (a) or
> (b)
> > as string2 in either (a) or (b).
> >
> > With that, if I have
> >
> > test_string1 = Ab
> > test_string2 = bA
> >
> > I will get:
> >
> > test_string1 in (a) => bA
> > test_string2 in (b) => bA
> >
> > thus a collision. I also get a collision with
> >
> > test_string1 in (b) => Ab
> >  test_string2 in (a) => Ab
> >
> > As I said, though, I don't have the other rules handy here -- "If an R,
> AL
> > or AN is present, no L may be present." may be just redundant.
> >
> > Mark
> >
>



-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080206/c7378514/attachment.html


More information about the Idna-update mailing list