Comments on IDNA Bidi

Tue Jan 15 07:50:37 CET 2008

Kenneth Whistler skrev:
> Harald,
>
> Now focussing on the section 4.1, Alternative approach
> in bidi-02.txt, I'll try to explain what Mark's reaction
> about the test was.
Thanks - this is almost exactly the generative test that I wrote after
reading your earlier message.

The text currently in my draft copy is below.

Note that I don't think we can include ET in the delimiter set; it makes
the string "A1" break up in a RTL context.

                           Harald

3.  An expanded justification for the bidi rule

   One issue with RFC 3454 was that it did not give an explicit
   justification for the bidi rule, thus it was hard to tell if a
   modified rule would continue to fulfil the purpose for which the RFC
   3454 rule was written.

   This document proposes an explicit justification, by stating a set of
   requirements for which we think it is possible to test whether or not
   the modified rule fulfils the requirement.

   The justification proposed is this:

   o  No two labels, when presented in visual order, should have the
      same sequence of characters without also having the same sequence
      of characters in network order.  (This is the criterion that is
      explicit in RFC 3454).

   o  In a visual presentation of a string of labels, the characters of
      each label should remain grouped between the characters delimiting
      the label components.

   o  These properties should hold true both when the string is embedded
      in a paragraph with LTR direction and when it's embedded in a
      paragraph with RTL direction, as long as explicit directional
      controls are not used within the same paragraph.

   Several stronger statements were considered and rejected, because
   they seem to be impossible to fulfil within the constraints of the
   Unicode bidirectional algorithm.  These include:

   o  The appearance of a label should be unaffected by its embedding
      context.  This proved impossible even for ASCII labels; the label
      "123-456" will have a different display order in a RTL context
      than in a LTR context.

   o  The sequence of labels should be consistent with network order.
      This proved impossible - a domain name consisting of the labels
      (in network order) L1.R1.R2.L2 will be displayed as L1.R2.R1.L2 in
      an LTR context.

   o  The "remain grouped" property should remain true when directional
      controls (LRE, RLE, RLO, LRO, PDF) are used in the same paragraph
      (outside of the labels).  Because these controls affect
      presentation order in non-obvious ways, by affecting the "sor" and
      "eor" properties of the Unicode BIDI algorithm, the conditions
      above would be very hard to satisfy for an useful set of strings
      if this was true.

   The "remain grouped" property can be more formally stated as:

   o  Let "Delimiterchars" be the set of characters with the Unicode
      BIDI properties ET, CS, WS, EN (note that EN may also be present
      in a label; both HYPHEN-MINUS and the @ sign have this bidi
      property)

   o  Let "Position" be the position of a character in a string (in
      network order)

   o  Let "Bidi position" be the position computed by the Unicode Bidi
      algorithm

   In the paragraph containing a string formed from the substrings A B L
   C D, where A and D are (possibly zero-length) legal labels, and B and
   C are single "Delimiterchars", the label L is a legal label if, for
   all A, B, C and D, the bidi position of all characters in L is within
   the range of positions for the characters of L in the string, for
   both the LTR and RTL paragraph direction.

   The "No two labels" property can be formally stated as:

   If two labels L and L', embedded as for the test above, are
   rearranged into the same sequence of codepoints, neither L nor L' is
   a legal label.