bidi spec

Erik van der Poel erikv at google.com
Thu Feb 7 06:53:32 CET 2008


It turns out that ICU4C has an option called
UBIDI_KEEP_BASE_COMBINING, and this did the trick (putting the base
character and combining mark in the "correct" order). When I did that,
I found that I no longer needed the new rule that I mentioned in my
previous email. You're also right that "If an R, AL or AN is present,
no L may be present" is simply redundant (same as the previous rule).
So here are all the rules, with "Keep" and "Remove" indicated:

Keep:
Only characters with the BIDI properties L, R, AL, EN, ES, BN, ON
and NSM are allowed.

Keep:
ES and ON are not allowed in the first position

Keep:
ES and ON, followed by zero or more NSM, is not allowed in the
last position

Keep:
If an L is present, no R, AL or AN may be present.

Remove (redundant):
If an R, AL or AN is present, no L may be present.

Keep:
If an EN is present, no AN may be present

Remove (redundant):
If an AN is present, no EN may be present

Remove:
If an AN is present, at least one R or AL must be present

Keep:
The first character may not be an NSM

Keep:
The first character may not be an EN (European Number) or an AN
(Arabic Number).

Harald, are you also able to remove the rules that I marked "Remove"
above, and still have the tests pass?

Erik

On Feb 6, 2008 7:52 PM, Mark Davis <mark.davis at icu-project.org> wrote:
> Here is what is happening with that. The BIDI algorithm is designed for
> display, and in display, the NSMs are designed to follow their base -- in
> display order. That means an NSM following an R character will come after it
> within its level (odd). So that is covered by the following rule:
>
>
>
>
> L3. Combining marks applied to a right-to-left base character will at this
> point precede their base character. If the rendering engine expects them to
> follow the base characters in the final display process, then the ordering
> of the marks and the base character must be reversed.
>
>
>
> What you are not seeing when you just look at the text is that what the bidi
> algorithm actually produces is a series of levels associated with the text.
> That level information is available at the time of L3 for the display engine
> to use.
>
> So what you need to do is make your test do L3 (not done by ICU, since it is
> targeted at display layout). So you need to make one more pass through each
> segment of text that is at an odd level, and reverse any sequence matching
> the following regex:
>  /[:bc=NSM:]+ [:^bc=NSM:]/
>
> =================
>
> I haven't looked in detail at "If an R, AL or AN is present, no L may be
> present.", and don't have the other rules handy here -- it may be redundant
> with them. But just to restate the test case:
>
> For the collision test, your test should be checking two environments:
>
> a) RLM + test_string
> b) LRM + test_string
>
> You'll test a series of strings that cover all the combinations. There is a
> collision failure if string1 has the same bidi results in either (a) or (b)
> as string2 in either (a) or (b).
>
> With that, if I have
>
> test_string1 = Ab
> test_string2 = bA
>
> I will get:
>
> test_string1 in (a) => bA
> test_string2 in (b) => bA
>
> thus a collision. I also get a collision with
>
> test_string1 in (b) => Ab
>  test_string2 in (a) => Ab
>
> As I said, though, I don't have the other rules handy here -- "If an R, AL
> or AN is present, no L may be present." may be just redundant.
>
> Mark
>


More information about the Idna-update mailing list