Comments on IDNA Bidi

Thu Jan 17 02:54:16 CET 2008

Harald,

> As far as I can tell, nobody tested the interaction between AN (Arabic
> Numeral) and domain name delimiters. They break badly - and in labels
> without any AL characters around, the rules in IDNA2003 simply don't
> come into play.
> 
> Neither do I believe that anyone seriously looked at the interactions
> with ON before suggesting that "middle dot" be allowed in domain names,
> or even ES (European Separator), of which the hyphen-minus (-) is the
> prime example of a character used in domain names.
> 
> It took me, a complete neophyte to the art of BIDI programming,
> approximately 4 days of work to come up with code to generate what I
> believe are the problematic cases. If those who have worked with BIDI
> for years had reviewed the 2003 specification with an eye to finding the
> corner cases that might cause problems, we might not have had a current
> problem.
> 
> Some concrete examples:
> 
> A label consisting of "L ET" breaks apart in an RTL context when
> embedded as L CS L ET CS.

Or to make this visually more apparent to people, if you
start with:

R.A.B#.r

where the "R"s are bc=R, hence strong types, and set the paragraph
embedding direction to RTL. If you run the bidi algorithm on that,
the middle full stop (bc=CS) will resolve to L, because it is
between two L characters, but the # and the other two full stops
will resolve to R, because they are between an R on one side and
an L on the other, and they then take the paragraph embedding
direction, i.e. R.

So then "A.B" is taken as the LLL run that needs reversal, and
you end up with:

first reversal  --> R.B.A#.r
second reversal --> r.#A.B.R

And Harald's point is well taken: the "#" has hopped over the
labels, and is now with the "A" instead of with the "B".

Further things to note:

1. This isn't a problem if the left-to-right labels are
ASCII labels, because bc=ET (most notably, "#", "%", and "$")
aren't allowed in domain names, anyway.

2. This *is* a problem with left-to-right IDNA2003 labels,
because IDNA2003 didn't sufficient constrain what was allowed
in labels, and so you get these bizarre cases.

3. This would *not* be a problem with IDNAbis labels, if
the inclusion table is properly defined, because it would
not allow any bc=ET character in a label, anyway.

bc=ES is another issue, because as Harald noted, that
is the bidi class of "-", which is certainly allowed in
labels in all 3 cases. In the above example, it would
behave just like the "#", so from:

R.A.B-.r

you would get: r.-A.B.R

with the same problem. Note, however, that this should only
be a problem with labels that terminate with these characters,
which isn't where people would normally put a "-". If you
had, instead:

R.A.B-C.r

the "-" resolves to L, instead of R, because it is between
two bc=L characters. So for that you get:

first reversal  --> R.C-B.A.r
second reversal --> r.A.B-C.R

which is fine.

> A label consisting of "EN AN" breaks apart in an LTR context when
> embedded as CS EN AN CS R.

A.4o.R  (where I am using "o" here as the ASCII standin for Arabic 5)

In this case, the first full stop and the "4" end up L, because
the "4" (bc=EN) gets its direction by looking back to the "A".
But the "o" (bc=AN) stays AN and acts as an R for determining
the direction of the second full stop. So you end up with
"o.R" as the RRR string needing reversal. So you get:

first reversal --> A.4R.o

And the "R" from the third label and the Arabic number "o" from
the 2nd label have hopped across between labels.

> I found eleven examples of 2-letter strings that would be permitted
> under the rules you proposed, but break apart when displayed either in
> an LTR context, an RTL context, or both. For 3-letter strings, I have
> 125 examples.

This could be constrained further for IDNAbis by noting the
interaction with the inclusion table.

Of course that wouldn't fix the bidi interoperability problem
for IDNA2003, but the point here is to fix it for the
future for IDNAbis, right?

So prohibition of bc=ET and bc=CS should be absolute in labels, I think.
That pares the numbers down.

That still leaves the unavoidable bc=CS (which should, however,
be simply the single character "-" and no others), bc=ON
(for MIDDLE DOT, if nothing else, but perhaps also some other
permissible modifier letters -- a small list, but still
a problem), and the numbers.

Michel's constraints pare things down further. By forcing any
RTL label to start with RCat and terminate with RCat or
NSMCat, the RTL labels themselves in IDNAbis should be
well-behaved.

What it doesn't catch are problems with LTR labels in RTL
contexts, where an initial or a terminal character without
a strong L direction ends up getting resolved to R across
the neutral (bc=CS, ".") delimiter, because of a resolution
rule favoring R in a context where it is between an R on
one side and an L on the other.

And it doesn't catch the problems with labels starting or
ending with digits, either.

One possible way to further sharply pare down the problem
with numerals, however, would be to do what seems to be
obvious anyway, and add a prohibition of bc=AN Arabic
numerals in LCat labels. bc=AN isn't a strong direction
itself in the bidi algorithm, but when it gets to resolution
of neutral types, any remaining bc=AN acts as an R context,
anyway.

And if we aren't allowing mixing of strong R and strong L
types in labels for IDNAbis, it doesn't make sense to
be mixing Arabic numbers into LTR labels, anyway.

So I'd suggest trying those steps:

1. Prohibit bc=ET and bc=CS totally in labels.
2. Prohibit bc=AN in LCat labels, and only allow them in
   RCat labels, which would then be further constrained,
   because they then could not start or terminate those
   labels.

Then see how many specific patterns remain in your test,
given those constraints.

--Ken