Comments on IDNA Bidi
Harald Alvestrand
harald at alvestrand.no
Tue Jan 15 07:50:37 CET 2008
Kenneth Whistler skrev:
> Harald,
>
> Now focussing on the section 4.1, Alternative approach
> in bidi-02.txt, I'll try to explain what Mark's reaction
> about the test was.
Thanks - this is almost exactly the generative test that I wrote after
reading your earlier message.
The text currently in my draft copy is below.
Note that I don't think we can include ET in the delimiter set; it makes
the string "A1" break up in a RTL context.
Harald
3. An expanded justification for the bidi rule
One issue with RFC 3454 was that it did not give an explicit
justification for the bidi rule, thus it was hard to tell if a
modified rule would continue to fulfil the purpose for which the RFC
3454 rule was written.
This document proposes an explicit justification, by stating a set of
requirements for which we think it is possible to test whether or not
the modified rule fulfils the requirement.
The justification proposed is this:
o No two labels, when presented in visual order, should have the
same sequence of characters without also having the same sequence
of characters in network order. (This is the criterion that is
explicit in RFC 3454).
o In a visual presentation of a string of labels, the characters of
each label should remain grouped between the characters delimiting
the label components.
o These properties should hold true both when the string is embedded
in a paragraph with LTR direction and when it's embedded in a
paragraph with RTL direction, as long as explicit directional
controls are not used within the same paragraph.
Several stronger statements were considered and rejected, because
they seem to be impossible to fulfil within the constraints of the
Unicode bidirectional algorithm. These include:
o The appearance of a label should be unaffected by its embedding
context. This proved impossible even for ASCII labels; the label
"123-456" will have a different display order in a RTL context
than in a LTR context.
o The sequence of labels should be consistent with network order.
This proved impossible - a domain name consisting of the labels
(in network order) L1.R1.R2.L2 will be displayed as L1.R2.R1.L2 in
an LTR context.
o The "remain grouped" property should remain true when directional
controls (LRE, RLE, RLO, LRO, PDF) are used in the same paragraph
(outside of the labels). Because these controls affect
presentation order in non-obvious ways, by affecting the "sor" and
"eor" properties of the Unicode BIDI algorithm, the conditions
above would be very hard to satisfy for an useful set of strings
if this was true.
The "remain grouped" property can be more formally stated as:
o Let "Delimiterchars" be the set of characters with the Unicode
BIDI properties ET, CS, WS, EN (note that EN may also be present
in a label; both HYPHEN-MINUS and the @ sign have this bidi
property)
o Let "Position" be the position of a character in a string (in
network order)
o Let "Bidi position" be the position computed by the Unicode Bidi
algorithm
In the paragraph containing a string formed from the substrings A B L
C D, where A and D are (possibly zero-length) legal labels, and B and
C are single "Delimiterchars", the label L is a legal label if, for
all A, B, C and D, the bidi position of all characters in L is within
the range of positions for the characters of L in the string, for
both the LTR and RTL paragraph direction.
The "No two labels" property can be formally stated as:
If two labels L and L', embedded as for the test above, are
rearranged into the same sequence of codepoints, neither L nor L' is
a legal label.
More information about the Idna-update
mailing list