Comments on IDNA Bidi

Kenneth Whistler kenw at sybase.com
Sat Jan 12 01:24:25 CET 2008


> > Bidi-2a.
> >
> >
> > If you really want a test, it would be something like the following:
> >
> >
> >    1. At build time, produce a test set T of characters, one from each
> >       of the BIDI classes where a character can be in IDNA (eg
> >       excluding B, LRE/O, RLE/O, and PDF). That is, roughly 14 characters.

Here are the 9 Bidi_Class values that would almost certainly
be allowed in IDNA labels, together with exemplar characters.

0061; L   # Ll  LATIN SMALL LETTER A
05D0; R   # Lo  HEBREW LETTER ALEF
0621; AL  # Lo  ARABIC LETTER HAMZA
0030; EN  # Nd  DIGIT ZERO
0660; AN  # Nd  ARABIC-INDIC DIGIT ZERO
002D; ES  # Pd  HYPHEN-MINUS
00B7; ON  # Po  MIDDLE DOT
200D; BN  # Cf  ZERO WIDTH JOINER
0300; NSM # Mn  COMBINING GRAVE ACCENT

The following 3 Bidi_Class values would almost certainly
be disallowed in labels, but including them in the test checks
for possible contexts of labels.

0024; ET # Sc   DOLLAR SIGN
002C; CS # Po   COMMA
0020; WS # Zs   SPACE

The other Bidi_Class values (B, S, LRE, LRO, RLE, RLO, PDF)
would be excluded categorically, I think, so you can exclude
them from the test.

> >    2. To test a given prospective label L, perform the following over
> >       all possible 2 characters strings X and Y from T. (That is, this
> >       would be 14^4 iterations.)

With 12 class values represented, 12^4 = 20736 strings.

> >    3. Create the the string S formed from: X + L+ Y

> >    4. Apply the BIDI algorithm to S twice, once with a RTL and once
> >       with LTR paragraph
> >       directions.

I.e. apply the bidi algorithm to S first with the assumption that
the paragraph direction is L (See UAX #9, BD5). Then apply it again
with the assumption that the paragraph direction is R. That
changes the current embedding level assigned at step X1 in the
bidi algorithm.

> >    5. If in the result and of the characters in the label are
> >       separated by a character
> >       from X or Y, the test fails.

> > However, this should really not be proposed as something that users of
> > IDNA should do. Instead, it should be used to test that Michel's
> > formulation is correct.
> Exactly - I want to test the algorithm before proposing one. However, I
> don't understand what you wrote above:
> 
> - if taken as written, it would test the string "A1" by embedding it
> between the strings "ALEPH BET" and "GIMEL DAV", which certainly would
> cause the test to fail (the "1" would pick up its directionality from
> the surrounding RTL characters, 

No, I don't think so. Mark can correct me, but in this case, the
"1" resolves EN->L, because it is preceded by "A".

> and the whole thing would likely display
> in the order of "1 DAV GIMEL A BET ALEPH" - I don't have my direction
> calculator with me).

--> DAV GIMEL A 1 BET ALEPH

which would pass the test.

> So I'm assuming you're thinking of some separators
> - which ones?
> 
> - what do you mean exactly by "with a RTL paragraph direction"? In
> particular, which of the 3 direction parameters "sor", "eor" and
> "embedding direction", which are input to the bidi algorithm, should be
> RTL, and should they all be locked to the same value, or should we also
> test mixtures of the 3?

No.

You start at step X1 by setting the current embedding level to the
"paragraph embedding level" (= paragraph direction). That can
either be L or R (see above). sor and eor are values calculated
for each run after determining all embedding levels (X2..X10).
For IDNA you can ignore most of that, because, unless I am
mistaken, explicit embeddings and explicit overrides would be
disallowed, anyway. So effectvely you start at X1 and move
right to X9 and X10, then start resolving the weak types.

> 
> More details, please...

Mark may have more clarifications to offer.

--Ken





More information about the Idna-update mailing list