Comments on IDNA Bidi

Tue Jan 15 01:18:51 CET 2008

Harald,

Now focussing on the section 4.1, Alternative approach
in bidi-02.txt, I'll try to explain what Mark's reaction
about the test was.

Section 4.1 now states:

========================================================

  Conceptually, to verify suitability as a domain name
  label, one constructs the string consisting of the label
  preceded and followed by a full stop (U+002E), and
  executes the Unicode bidirectional algorithm twice,
  once with <sor> (start of run) and <eor> (end of run)
  having direction L, and once with them having
  direction R. ...

  The following conditions MUST be true in both
  resulting strings for the string to be acceptable:

  * The leftmost and rightmost character of the resulting
    string in display order must be a full stop (U+002E)

  * No non-spacing mark (NSM) can occur in the second
    position of the string (leftmost in L order,
    rightmost in R order); that is, no mark can be
    allowed to attach to the delimiting characters.

  * The direction of the leftmost and rightmost characters
    in the string (the periods) must be either L or R

  Note that there is no requirement that the character
  sequence be the same in the two cases.

=========================================================

First, I need to quibble with some of the details in
that text.

1. The last note should be that there is no requirement
that the *display order* of the label be the same in
the two cases. Of course there is a requirement that
the *character sequence* be the same -- that is exactly
what is being tested in the first place, and the bidi
algorithm doesn't change character sequence per se.

2. The 3rd bullet condition is irrelevant. The first
bullet is requiring that the dots stay at the beginning
and end of the label -- which *is* relevant. But the
resolved bidi type of the bc=CS periods (U+002E) is
irrelevant if the display order requirement was met.

3. The second bullet is not properly construed. The
bidi algorithm does not move combining marks around
their bases in logical order. And they always
acquire the bidi type of their base character. So
that means for any base+combining mark in the string,
the resolved display order is still going to be
base then combining mark in the display order.
The condition that the second bullet is trying to guard
against is a constraint on the *input* character
sequence, rather than the display order resulting
from the application of the bidi algorithm. And
it can be stated simply as no input string can *start*
with an NSM.

4. Finally, indirectly what the 3rd bullet seems to be
requiring is also better stated as a condition on
the string itself -- namely that if it begins with
an LCat character, it must end with an LCat character
(or an NSMCat character), and if it begins with
an RCat character, it must end with an RCat character
(or an NSMCat character). That is easy to test even
prior to application of the bidi algorithm.

So here is how I would recast what section 4.1 is trying
to accomplish:

========================================================

  Conceptually, to verify suitability as a domain name
  label, one constructs the string consisting of the label
  preceded and followed by a full stop (U+002E), and
  executes the Unicode bidirectional algorithm twice,
  once with paragraph embedding direction L and
  once with paragraph embedding direction R.

  First, the following conditions MUST be true for
  the label itself:

  * It must not start with an NSMCat character.

  * If it starts with an LCat character, it must
    end with an LCat or NSMCat character.

  * If it starts with an RCat character, it must
    end with an RCat or NSMCat character.

  Furthermore, the following condition MUST be true for
  the display orders that result from the two
  applications of the Unicode bidirectional
  algorithm, for the string to be acceptable as
  a domain name label:

  * The leftmost and rightmost character of the resulting
    string in display order must be a full stop (U+002E)

  Note that there is no requirement that the display
  orders of the label itself be the same in the two cases.

=========================================================

Michel's formulation goes further, in that it forbids
mixing of RCat and LCat characters in the label. If
you do that, then you automatically get well-behaved
labels between periods, and you don't have to specify
the need for applying the bidi algorithm and checking
for the display order of the periods. So that is another
of the reasons why I favor it.

*******************************************************

Now on to the issue of Mark's formulation of the
test.

What you are proposing currently in section 4.1 of bidi-02.txt
amounts to the following:

Construct the string: ".LABEL."

Run the bidi algorithm on ".LABEL." with paragraph embedding
direction = L, resulting in output display D1.

Run the bidi algorithm on ".LABEL." with paragraph embedding
direction = R, resulting in output display D2.

Check D1 and D2. As long as both follow the pattern {.S.},
where "S" is any permutation of the display order for
the characters in "LABEL" resulting from the bidi algorithm, 
but with the "." and the "."
staying firmly at both ends, then the "LABEL" is allowed.
(I know Section 4.1 as currently worded requires somewhat
more, but see the above analysis about that.)

What Mark did was just instantly generalize the test.
First generalize the context around the dots:

Construct the string: "x.LABEL.y"

Then iterate the bidi algorithm through x's and y's with
all possible relevant bidi class values (i.e., you
can exclude the embedding and override controls as
irrelevant to this test). x's with
Bidi_Class=R or AL and Bidi_Class=L will force the paragraph
embedding direction to R and L, respectively, so those
cases are tested explicitly. x's with other Bidi_Class
values require having an externally determined
paragraph embedding direction, so for generality, you simply
run the bidi algorithm twice, once with each possible
paragraph embedding direction, to check all results.

Now generalize the test again by generalizing the dots
themselves to encompass all other possible delimiters
(and even cases that wouldn't be ordinarily be considered
label delimiters):

Construct the string "xdLABELcy"

Then iterate the bidi algorithm through x's and y's with
all possible relevant bidi class values, and through
d's and c's with all possible relevant bidi class values.
*Among* those d's and c's will be the bc=CS relevant
to testing explicitly with U+002E FULL STOP, bc=ON
relevant to testing explicitly with U+0040 COMMERCIAL AT,
and so on for any possible label delimiter you might
be concerned about for testing.

And the test outcome you are looking for is that for
any given LABEL string, for all bidi class values of
x's and y's and the label delimiters d's and c's, and
for both embedding paragraph directions, that in
the reordered display output, no part of the "LABEL"
transgresses outside the delimiters "d" and "c".

In other words, the allowable output patterns you are
looking for would be of the type: "xdScy" or "ycSdx",
where the x's and y's stay strictly outside the c's
and d's, and the S stands for the content of the "LABEL",
and stays strictly inside the c's and d's (but of course
may have its own display order rearranged).

There, is that clear now?

--Ken