Comments on IDNA Bidi
Kenneth Whistler
kenw at sybase.com
Sat Jan 12 01:24:25 CET 2008
> > Bidi-2a.
> >
> >
> > If you really want a test, it would be something like the following:
> >
> >
> > 1. At build time, produce a test set T of characters, one from each
> > of the BIDI classes where a character can be in IDNA (eg
> > excluding B, LRE/O, RLE/O, and PDF). That is, roughly 14 characters.
Here are the 9 Bidi_Class values that would almost certainly
be allowed in IDNA labels, together with exemplar characters.
0061; L # Ll LATIN SMALL LETTER A
05D0; R # Lo HEBREW LETTER ALEF
0621; AL # Lo ARABIC LETTER HAMZA
0030; EN # Nd DIGIT ZERO
0660; AN # Nd ARABIC-INDIC DIGIT ZERO
002D; ES # Pd HYPHEN-MINUS
00B7; ON # Po MIDDLE DOT
200D; BN # Cf ZERO WIDTH JOINER
0300; NSM # Mn COMBINING GRAVE ACCENT
The following 3 Bidi_Class values would almost certainly
be disallowed in labels, but including them in the test checks
for possible contexts of labels.
0024; ET # Sc DOLLAR SIGN
002C; CS # Po COMMA
0020; WS # Zs SPACE
The other Bidi_Class values (B, S, LRE, LRO, RLE, RLO, PDF)
would be excluded categorically, I think, so you can exclude
them from the test.
> > 2. To test a given prospective label L, perform the following over
> > all possible 2 characters strings X and Y from T. (That is, this
> > would be 14^4 iterations.)
With 12 class values represented, 12^4 = 20736 strings.
> > 3. Create the the string S formed from: X + L+ Y
> > 4. Apply the BIDI algorithm to S twice, once with a RTL and once
> > with LTR paragraph
> > directions.
I.e. apply the bidi algorithm to S first with the assumption that
the paragraph direction is L (See UAX #9, BD5). Then apply it again
with the assumption that the paragraph direction is R. That
changes the current embedding level assigned at step X1 in the
bidi algorithm.
> > 5. If in the result and of the characters in the label are
> > separated by a character
> > from X or Y, the test fails.
> > However, this should really not be proposed as something that users of
> > IDNA should do. Instead, it should be used to test that Michel's
> > formulation is correct.
> Exactly - I want to test the algorithm before proposing one. However, I
> don't understand what you wrote above:
>
> - if taken as written, it would test the string "A1" by embedding it
> between the strings "ALEPH BET" and "GIMEL DAV", which certainly would
> cause the test to fail (the "1" would pick up its directionality from
> the surrounding RTL characters,
No, I don't think so. Mark can correct me, but in this case, the
"1" resolves EN->L, because it is preceded by "A".
> and the whole thing would likely display
> in the order of "1 DAV GIMEL A BET ALEPH" - I don't have my direction
> calculator with me).
--> DAV GIMEL A 1 BET ALEPH
which would pass the test.
> So I'm assuming you're thinking of some separators
> - which ones?
>
> - what do you mean exactly by "with a RTL paragraph direction"? In
> particular, which of the 3 direction parameters "sor", "eor" and
> "embedding direction", which are input to the bidi algorithm, should be
> RTL, and should they all be locked to the same value, or should we also
> test mixtures of the 3?
No.
You start at step X1 by setting the current embedding level to the
"paragraph embedding level" (= paragraph direction). That can
either be L or R (see above). sor and eor are values calculated
for each run after determining all embedding levels (X2..X10).
For IDNA you can ignore most of that, because, unless I am
mistaken, explicit embeddings and explicit overrides would be
disallowed, anyway. So effectvely you start at X1 and move
right to X9 and X10, then start resolving the weak types.
>
> More details, please...
Mark may have more clarifications to offer.
--Ken
More information about the Idna-update
mailing list