Comments on IDNA Bidi
harald at alvestrand.no
Sat Jan 12 08:07:47 CET 2008
Ken, thanks for being more precise!
I do think my original questions still remain unclear:
- What characters should we test?
- What property of the strings formed should we test?
- What context should we test that property in?
The first 2 seem to be closing in on being clear enough to be testable.
I still have problems with the third.
Kenneth Whistler skrev:
>>> If you really want a test, it would be something like the following:
>>> 1. At build time, produce a test set T of characters, one from each
>>> of the BIDI classes where a character can be in IDNA (eg
>>> excluding B, LRE/O, RLE/O, and PDF). That is, roughly 14 characters.
> Here are the 9 Bidi_Class values that would almost certainly
> be allowed in IDNA labels, together with exemplar characters.
> 0061; L # Ll LATIN SMALL LETTER A
> 05D0; R # Lo HEBREW LETTER ALEF
> 0621; AL # Lo ARABIC LETTER HAMZA
> 0030; EN # Nd DIGIT ZERO
> 0660; AN # Nd ARABIC-INDIC DIGIT ZERO
> 002D; ES # Pd HYPHEN-MINUS
> 00B7; ON # Po MIDDLE DOT
> 200D; BN # Cf ZERO WIDTH JOINER
> 0300; NSM # Mn COMBINING GRAVE ACCENT
> The following 3 Bidi_Class values would almost certainly
> be disallowed in labels, but including them in the test checks
> for possible contexts of labels.
> 0024; ET # Sc DOLLAR SIGN
> 002C; CS # Po COMMA
> 0020; WS # Zs SPACE
> The other Bidi_Class values (B, S, LRE, LRO, RLE, RLO, PDF)
> would be excluded categorically, I think, so you can exclude
> them from the test.
I could quibble with the list - in particular, the @ sign occurs
frequently next to labels, and it's an ON - same bidi class as the
middle dot. But this is a good starting point.
>>> 2. To test a given prospective label L, perform the following over
>>> all possible 2 characters strings X and Y from T. (That is, this
>>> would be 14^4 iterations.)
> With 12 class values represented, 12^4 = 20736 strings.
>>> 3. Create the the string S formed from: X + L+ Y
>>> 4. Apply the BIDI algorithm to S twice, once with a RTL and once
>>> with LTR paragraph
> I.e. apply the bidi algorithm to S first with the assumption that
> the paragraph direction is L (See UAX #9, BD5). Then apply it again
> with the assumption that the paragraph direction is R. That
> changes the current embedding level assigned at step X1 in the
> bidi algorithm.
Thanks - this has enough details that it's possible to start implementing!
But - what values are you assuming for "sor" and "eor"? My preliminary
coding shows that there are cases that come out OK when these are
consistent with the paragraph direction, but come out looking bad when
they are different.
>>> 5. If in the result and of the characters in the label are
>>> separated by a character
>>> from X or Y, the test fails.
>>> However, this should really not be proposed as something that users of
>>> IDNA should do. Instead, it should be used to test that Michel's
>>> formulation is correct.
>> Exactly - I want to test the algorithm before proposing one. However, I
>> don't understand what you wrote above:
>> - if taken as written, it would test the string "A1" by embedding it
>> between the strings "ALEPH BET" and "GIMEL DAV", which certainly would
>> cause the test to fail (the "1" would pick up its directionality from
>> the surrounding RTL characters,
> No, I don't think so. Mark can correct me, but in this case, the
> "1" resolves EN->L, because it is preceded by "A".
>> and the whole thing would likely display
>> in the order of "1 DAV GIMEL A BET ALEPH" - I don't have my direction
>> calculator with me).
> --> DAV GIMEL A 1 BET ALEPH
> which would pass the test.
That's what comes from not testing before I speak... the problematic one
I tested is equivalent to ALEF 1 A 2 BET, which will display as 1 ALEF A
2 BET in a totally LTR context, and as BET 1 A 2 ALEF in a totally RTL
context. Unless I have made a programming error... this kind of oddnes
is why I was asking half a year ago for test cases for the BIDI
algorithm. It's just too complex an algorithm for me to be sure I've
implemented it correctly without test data.
>> So I'm assuming you're thinking of some separators
>> - which ones?
>> - what do you mean exactly by "with a RTL paragraph direction"? In
>> particular, which of the 3 direction parameters "sor", "eor" and
>> "embedding direction", which are input to the bidi algorithm, should be
>> RTL, and should they all be locked to the same value, or should we also
>> test mixtures of the 3?
> You start at step X1 by setting the current embedding level to the
> "paragraph embedding level" (= paragraph direction). That can
> either be L or R (see above). sor and eor are values calculated
> for each run after determining all embedding levels (X2..X10).
> For IDNA you can ignore most of that, because, unless I am
> mistaken, explicit embeddings and explicit overrides would be
> disallowed, anyway. So effectvely you start at X1 and move
> right to X9 and X10, then start resolving the weak types.
If I read it correctly, you're saying that the string "A" in a LTR
paragraph has sor and eor set to L, but in the string "A<RLE>B<PDF>C",
the levels will be 0 1 0, and the sor and eor for the run containing "B"
will be RTL by X10.
In the string "A<RLE>B<LRE>C<PDF><PDF>", the levels will be 0 1 2, and
the sor of B will be RTL, and the eor of B will be LTR.
I implemented and tested the algorithm starting after X10, because a) I
didn't think we'd want to have directional characters in labels, and b)
because I couldn't see a way to constrain the values of sor and eor
unless we start forbidding ANY explicit directionality characters in
paragraphs containing labels - which seems kind of draconian.
So again - which combinations of "sor" and "eor" values do you think we
should test for in the test described above?
More information about the Idna-update