Comments on IDNA Bidi

Harald Alvestrand harald at alvestrand.no
Thu Jan 10 08:38:55 CET 2008


Mark Davis skrev:
> I've collected together comments on the four documents, and tried to
> organize them for reference. Here is the first set.
Apologies for not responding to this earlier.... I'm afraid I lost track
just before Christmas, and didn't find it on the return.

Deleting comments where I have no comments....
>
> Bidi-1.
>
>    Note that Unicode 5.0 is the current version of Unicode.  This fix
>    refers to Unicode 3.2 only, to maintain consistency with the rest of
>    RFC 3454.  Nothing here should affect the relationship between
>
>    Unicode versions and IDNA.
>
> But making it specific to U3.2 *does* tie it to a particular version.
> Is the intention for this to modify IDNA2003 before IDNAbis comes out?
> That doesn't seem to be the case for the rest of the documents. Better
> would be for it to refer to the version of Unicode used by IDNA
> (whatever version it is).
>
When it was originally written, it wasn't clear that IDNAbis would come
out, or when. Now that seems clearer, so reference needs updating (and
-protocol and -issues needs to be added to the reference list. Thanks!)
>
>
> In the same vein, tying the comment to RFC 3454 is limiting as the
> solution that the document is proposing is in the context of IDNA-bis
> which does away with stringprep/nameprep. Overall the document should
> take a more generic view for solution, not just stringprep (RFC 3454)
> specific.
>
>
> Bidi-2. 
>    The following conditions MUST be true in both resulting strings for
>    the string to be acceptable:
>    o  The leftmost and rightmost character of the resulting string in
>       display order must be a full stop (U+002E)
>   
>    o  No non-spacing mark (NSM) can occur in the second position of the
>       string (leftmost in L order, rightmost in R order); that is, no
>       mark can be allowed to attach to the delimiting characters.
>   
>    o  The direction of the leftmost and rightmost characters in the
>       string (the periods) must be either L or R
>
> The NSM condition should be part of the main IDNA conditions, not here.
>
Agreed. It was placed here because it was another 3454 oversight that
needs correcting.
>
>
> Bidi-2a.
>
>
> If you really want a test, it would be something like the following:
>
>
>    1. At build time, produce a test set T of characters, one from each
>       of the BIDI classes where a character can be in IDNA (eg
>       excluding B, LRE/O, RLE/O, and PDF). That is, roughly 14 characters.
>    2. To test a given prospective label L, perform the following over
>       all possible 2 characters strings X and Y from T. (That is, this
>       would be 14^4 iterations.)
>    3. Create the the string S formed from: X + L+ Y
>    4. Apply the BIDI algorithm to S twice, once with a RTL and once
>       with LTR paragraph
>       directions.
>    5. If in the result and of the characters in the label are
>       separated by a character
>       from X or Y, the test fails.
>
>
> However, this should really not be proposed as something that users of
> IDNA should do. Instead, it should be used to test that Michel's
> formulation is correct.
Exactly - I want to test the algorithm before proposing one. However, I
don't understand what you wrote above:

- if taken as written, it would test the string "A1" by embedding it
between the strings "ALEPH BET" and "GIMEL DAV", which certainly would
cause the test to fail (the "1" would pick up its directionality from
the surrounding RTL characters, and the whole thing would likely display
in the order of "1 DAV GIMEL A BET ALEPH" - I don't have my direction
calculator with me). So I'm assuming you're thinking of some separators
- which ones?

- what do you mean exactly by "with a RTL paragraph direction"? In
particular, which of the 3 direction parameters "sor", "eor" and
"embedding direction", which are input to the bidi algorithm, should be
RTL, and should they all be locked to the same value, or should we also
test mixtures of the 3?

More details, please...
>
> Bidi-3.
>
>    We believe that there is a clear likelihood of similar issues
>    existing with other scripts and languages that are not currently used
>    extensively with IDNs.  Careful consideration of all the languages
>
>    written in a given script, in consultation with all of the
>    corresponding speech communities, is therefore needed before we can
>    say with any degree of certainty that using that script for IDNs is
>    unproblematic.
>   
>
> This is not a bidi issue, and should be in a different document. (See
> other comments about "speech communities")
>
It's certainly a bidi issue too; as you know, one of the driving forces
for the clarification here is the problem of Yiddish written in the
Hebrew script. But now that this text is safely embedded in "issues",
and the decision is made to link this document to "issues", the need for
this text here is much lessened.
>
>
> Bidi-4.
>
>    Another set of issues concerns the proper display of IDNs with a
>    mixture of LTR and RTL labels, or only RTL labels; it is not clear to
>    these authors what the proper display order of the components of a
>
>    domain name are if the directiion of the components (in network
>    order) is, for instance, FirstRTL.SecondRTL.LTR - is it
>    LTRtsriF.LTRdnoceS.LTR or LTRdnoceS.LTRtsrif.LTR?  Again, this memo
>    does not attempt to suggest a solution to this problem.
>   
>
> If the question is: what does the BIDI algorithm do in such cases, the
> answer is easy to determine. If the question is whether a user agent
> should display a URL in a different order than the BIDI algorithm, I
> think that's beyond the scope of this document. Note that any attempt
> to have it display differently requires all text processors to
> recognize URLs and handle them specially, with problems of
> interoperability and confusion when, inevitably, most of them fail. So
> recommending a non-standard display will probably do more harm than good.
>
I personally think that recommending a non-standard display is a
non-starter. We probably need to reformulate this paragraph as "the
result of the Unicode BIDI algorithm is LTRtsriF.LTRdnoceS.LTR, people
may be surprised by that, but we can't fix it". I'll have to test that
this is true in all cases before saying it in the document, though...
>
> Bidi-5.
>    One particular example of the last case is if a program chooses to
>    examine the last character (in network order) of a string in order to
>    determine its directionality, rather than its first; if it finds an
>
>    NSM character and tries to display the string as if it was a left-to-
>    right string, the resulting display may be interesting, but not
>    useful.
>
> I don't understand this paragraph. When and why would this happen with
> IDNA-conformant programs?
>
I think the text is clear enough - if you get a label "ALEF BET <some
NSM character>", an IDNA2003 program can look at the last character in
the string and say "this is not a RTL string", and treat it as if it was
LTR. In IDNA2003, that will be a safe assumption. In IDNAx, it will not
be a safe assumption.

Suggestions for a clearer way to state it?

                         Harald



More information about the Idna-update mailing list