Comments on IDNA Bidi

Mon Jan 14 22:17:10 CET 2008

Hard,

> Ken, thanks for being more precise!

You're welcome.

> I do think my original questions still remain unclear:
> - What characters should we test?
> - What property of the strings formed should we test?
> - What context should we test that property in?
> 
> The first 2 seem to be closing in on being clear enough to be testable.
> I still have problems with the third.

Although bidi is not my forté by a long shot, I'll make
another attempt at explaining this even *more* clearly.
I went back and took another careful pass on your bidi-02-txt,
to try to figure out *exactly* what you are looking to test
in section 4.1, Alternative approach, and why that led Mark
to quickly respond that if you really wanted a test, you would
just to {blah blah}, where, unfortunately, exactly what
that consisted of was not entirely clear either to you or
to me. :-)

> I could quibble with the list - in particular, the @ sign occurs
> frequently next to labels, and it's an ON - same bidi class as the
> middle dot. But this is a good starting point.

In my reformulation of the issue, it should be clearer why
this doesn't really matter. The short response is that the
bidi algorithm depends *only* on the bidi class of the characters,
so if you test all possible combinations based on the permissible
bidi classes, it doesn't matter *which* character you are using
specifically in the string. You'd get the same result with
@ sign as your bc=ON exemplar as you would using any other
bc=ON exemplar.

> But - what values are you assuming for "sor" and "eor"? 

I'm not assuming *any* values for them. All I am assuming is
that given an assumption of paragraph direction, "sor" and
"eor" for all runs will be unambiguously determined by the algorithm 
for any string being tested.

> My preliminary
> coding shows that there are cases that come out OK when these are
> consistent with the paragraph direction, but come out looking bad when
> they are different.

I think your test may be set up wrong then.

> That's what comes from not testing before I speak... the problematic one
> I tested is equivalent to ALEF 1 A 2 BET, which will display as 1 ALEF A
> 2 BET in a totally LTR context, and as BET 1 A 2 ALEF in a totally RTL
> context. Unless I have made a programming error...

O.k., without using explicit embeddings, the way you'd get
a "totally LTR context" or a "totally RTL context" would be to
insert these inside strong L or strong R characters which would
then define the paragraph embedding level by P2. So:

Totally LTR context:

C D ALEF 1  A 2  BET E F
0 0   0  0  0 0   0  0 0  --> sor=L, eor=L
L L   R  EN L EN  R  L L
L L   R  EN L L   R  L L  rule W7
0 0   1  2  0 0   1  0 0  rule I2
      ====

Then you apply the display reversals of rule L2 and get:

C D 1 ALEF A 2 BET E F
    xxxxxxxxxxxxxx

Totally RTL context:

GIMEL DALET ALEF 1  A 2  BET HE VAV
  1     1     1  1  1 1   1   1  1    --> sor=R, eor=R
  R     R     R  EN L EN  R   R  R
  R     R     R  EN L L   R   R  R  rule W7
  1     1     1  2  2 2   1   1  1  rule I2
                 ======

Then you apply the display reversals of rule L2 and get:

VAV HE BET 1 A 2 ALEF DALET GIMEL
       xxxxxxxxxxxxxx

That is consistent with your results.

Effectively, what the bidi algorithm is doing in these
cases is treating the input to be read as follows, where
you start at the asterisk:

[C D] [ALEF 1] [A 2] [BET] [E F]
*> >    <   <   > >    <    > >

[GIMEL DALET ALEF] [1 A 2] [BET HE VAV]
*  <     <     <    > > >    <   <  <

> this kind of oddnes
> is why I was asking half a year ago for test cases for the BIDI
> algorithm. It's just too complex an algorithm for me to be sure I've
> implemented it correctly without test data.

Well, yes, but I think what Asmus and Mark would counter, for
the most part, is that any short selection of test data is
necessarily incomplete, which is why they talk in terms of
test scenarios that reiterate through all combinations of
Bidi_Class for strings. You set up a test that does that for
your implementation, and then do the same against one of the
reference implementations of the bidi algorithm. If the results
are different, there is a problem in your implementation,
presumably.

> If I read it correctly, you're saying that the string "A" in a LTR
> paragraph has sor and eor set to L, but in the string "A<RLE>B<PDF>C",
> the levels will be 0 1 0, and the sor and eor for the run containing "B"
> will be RTL by X10.
> 
> In the string "A<RLE>B<LRE>C<PDF><PDF>", the levels will be 0 1 2, and
> the sor of B will be RTL, and the eor of B will be LTR.

Yes, but as I indicated, all the explicit embedding stuff is
irrelevant. All those codes are categorically ruled out by
RFC 3454, and I don't think anything we are proposing allows
them back in.

> I implemented and tested the algorithm starting after X10, because a) I
> didn't think we'd want to have directional characters in labels,

Correct.

> and b)
> because I couldn't see a way to constrain the values of sor and eor
> unless we start forbidding ANY explicit directionality characters in
> paragraphs  containing labels - which seems kind of draconian.

In arbitrary text paragraphs that happen to contain *mentions* of
IDNA labels, sure. But I don't think that is what we would be
testing, in any case. We're just testing concatenations of
labels with dots or other label separators, and the "paragraph"
in that case is just the string being tested.

> 
> So again - which combinations of "sor" and "eor" values do you think we
> should test for in the test described above?

As above. They are defined by the test context.

<sor> C D 1 ALEF A 2 BET E F <eor>

<sor> GIMEL DALET ALEF 1  A 2  BET HE VAV <eor>

Those *are* the paragraphs. The paragraph embedding levels are
defined (according to P2) by the "C" in the first case and
the "GIMEL" in the second case. And you derive the Bidi_Class
of the <sor> and <eor> in each case directly from the levels
of the runs. And in each case there is only one run, because
we aren't allowing explicit embeddings and overrides.

--Ken