Comments on bidi-04

Mon Mar 3 20:45:19 CET 2008

--On 3. mars 2008 09:18 -0800 Mark Davis <mark.davis at icu-project.org> wrote:

> Comments below.

and responses inline....

>
> On Sat, Mar 1, 2008 at 5:10 PM, Mark Davis <mark.davis at icu-project.org>
> wrote:
>>
>> 1.2. Background and history
>
> The history is useful, but shouldn't be right up at the front. The
> majority of people reading this document won't care why it is the way it
> is, they will care what the spec says and how to use it.
>
> Sections 1.2, and all of 2, should be moved either to the Rationale
> document with the rest of the rationale for all the other parts, or at
> least moved to an appendix. Given the structure of the documents,
> Rationale would be better.

Moving it out of the bidi document destroys the self-containedness of the 
document.
My personal opinion is that anyone who wants to implement bidi SHOULD have 
to scan at least this much information at least once.

So unless there's a strong push for this move, I'll not make this change.

>
>>
>> The IDNA specification "Stringprep", [RFC3454] makes the following
>> statement in its section 6 on the bidi algorithm, :
>>
> ...
>
>> The justification proposed is this:
>>
>>
>> o No two labels, when presented in display order, should have the
>> same sequence of characters without also having the same sequence
>> of characters in network order. (This is the criterion that is
>>
>> explicit in RFC 3454).
>
> The above needs to be qualified, by adding something like "in the same
> bidi context". That is, as pointed out below, if you change the embedding
> context, 123-456 in one context may look like 456-123 in another; same
> for abc.ABC and ABC.abc.

Erik, did you verify this within one context, or between contexts?

>>
>> o In a display of a string of labels, the characters of each label
>> should remain grouped between the characters delimiting the
>> labels.
>
> You need to name both clauses, since you refer to them below. Something
> like:
>
> o Label Uniqueness Condition. No two labels, when presented in display
> order, should have the
> ...
> o Label Grouping Condition. In a display of a string of labels, the
> characters of each label

Either that, or "Condition A" and "Condition B".
> ...
>
>>
>> o These properties should hold true both when the string is embedded
>>
>> in a paragraph with LTR direction and when it's embedded in a
>> paragraph with RTL direction, as long as explicit directional
>> controls are not used within the same paragraph.
>>
>> Several stronger statements were considered and rejected, because
>>
>> they seem to be impossible to fulfil within the constraints of the
>> Unicode bidirectional algorithm. These include:
>>
>> o The appearance of a label should be unaffected by its embedding
>> context. This proved impossible even for ASCII labels; the label
>>
>> "123-456" will have a different display order in an RTL context
>> than in a LTR context.
>>
>>
>>
>>
>> Alvestrand & Karp Expires August 17, 2008 [Page 7]
>>
>> Internet-Draft IDNA RTL fix Feb 2008
>>
>>
>>
>> o The sequence of labels should be consistent with network order.
>> This proved impossible - a domain name consisting of the labels
>> (in network order) L1.R1.R2.L2 will be displayed as L1.R2.R1.L2 in
>>
>> an LTR context.
>>
>> o The "remain grouped" property should remain true when directional
>> controls (LRE, RLE, RLO, LRO, PDF) are used in the same paragraph
>> (outside of the labels). Because these controls affect
>>
>> presentation order in non-obvious ways, by affecting the "sor" and
>> "eor" properties of the Unicode BIDI algorithm, the conditions
>> above would be very hard to satisfy for an useful set of strings
>>
>> if this was true. As long as these controls have no influence
>> over the display of the domain name, no problem will be caused,
>> but the exact criterion for "will not influence" is hard to
>>
>> codify.
>
> The above is too strong. We didn't actually attempt to see what
> difference these would make. Certainly they are a different bidi context.

I did the tests. I'll stand behind the statement.
If you have tests that say other things, present them, and we'll discuss.

> However, using overrides forces all characters to have the same
> directionality, so it is actually *easy* to meet the above criteria in
> that case, because having the same order will guarantee -- within *that*
> context -- both disambiguation and contiguity.

You did not read the statement the way I intended it. Having an LRO on one 
side of the label and a PDF on the other side would indeed force a single 
context, but would also offer a certain way of creating confusable labels - 
I would regard that as being used "on" the labels.

>
> So I'd suggest just rewording to say that that wasn't a goal.

Nope. It was an actual found problem; I started out thinking it should be a 
goal, and found it impossible to attain.

>> o The "no two labels display the same" should hold true between LTR
>> paragraphs and RTL paragraphs. This was shown to be unsound.
>
> The word "unsound" both here and a few lines down, is the wrong word. You
> should either use "untenable" or "impossible" (as you had above).

"Untenable" is a good word.
>
>>
>> o No two domain names should be displayed the same, even under
>>
>> differing directionality. This was shown to be unsound, since the
>> domain name (network) ABC.abc will have display order CBA.abc in
>> an LTR context and abc.CBA in an RTL context, while the domain
>>
>> name (network) abc.ABC will display as abc.CBA in an LTR context
>> and as CBA.abc in an RTL context.
>>
>
> ...
>
>> The "remain grouped" property can be more formally stated as:
>>
>> o Let "Delimiter chars" be a set of characters with the Unicode BIDI
>>
>> properties CS, WS, ON. (These are commonly used to delimit labels
>> - both the FULL STOP and the space are included.)
>>
>> * ET, though it commonly occurs next to domain names in practice,
>> is problematic: the context R CS L EN ET (for instance A.a1%)
>>
>> makes the label L EN grow unstable.
>
> "grow unstable" should be "become unstable". However, you only define
> what "unstable" means below, so either you need to move the definition
> up. I'd actually prefer being more explicit, since "unstable" can have
> multiple meanings, and just use explicit phrasing like:
>
> makes the label L EN break the Label Grouping condition.

That makes sense.
>
>>
>> * ES commonly occurs in labels as HYPHEN-MINUS, but could also be
>> used as a delimiter (for instance, the plus sign). It is left
>> out here.
>>
>>
>> o Let "Position" be the position of a character in a string (in
>> network order)
>>
>> o Let "Bidi position" be the position computed by the Unicode Bidi
>> algorithm
>>
>> In a paragraph with an embedded string formed from the substrings A B
>>
>> L C D, where A and D are (possibly zero-length) legal labels, and B
>> and C are single "Delimiter chars", the label L is a legal label if,
>> for all A, B, C and D, the bidi position of all characters in L is
>>
>> within the range of positions for the characters of L in the string,
>> for both the LTR and RTL paragraph direction.
>
> This doesn't make sense to me. "be within the range of positions for the
> characters?"

I didn't find a more readable way of saying "they all stick together".

>
> Moreover, you can't say that that makes a label *legal*, since it might
> not be legal because of the other conditions in IDNA. The definition is
> also circular, since you can only define L to be legal if you know what
> makes A and B legal. And the discussion of a "paragraph" may be confusing
> to people -- what does a paragraph have to do with a label condition?
>
> I suggest something like:
>
> A label L satisfies the Label Grouping Condition when for any Delimiter
> Characters D1 and D2 and any other strings S1 and S2 (possibly of length
> zero):
>
>
> If the string formed by concatenating S1, D1, L, D2, S2 is subject to
> bidi reordering,
> then all of the characters of L2 in the reordered string are between D1
> and D2.
>
>   - The bidi reordering of L1, D1, L, D2, L2 may result in D2 coming
> before D1
>   - Because S1 is any string, the bidi algorithm may set the paragraph
> direction for the string to either Right-To-Left or Left-To-Right; thus
> the reordering condition has to work in both bidi contexts.

that indeed sounds better.

S1 isn't allowed to be "any string". It's constrained to be a valid label.

>
>>
>> (The "zero-length" case represents the case where a domain name is
>> next to something that isn't a domain name, separated by a delimiter
>>
>> character).
>>
>> The "No two labels" property can be formally stated as:
>>
>> If two labels L and L', embedded as for the test above, displayed in
>> a paragraph with the same directionality, are rearranged into the
>
> rearranged => bidi reordered
>>
>>
>>
>>
>> Alvestrand & Karp Expires August 17, 2008 [Page 9]
>>
>> Internet-Draft IDNA RTL fix Feb 2008
>>
>>
>> same sequence of codepoints, neither L nor L' is a legal label.
>>
>>
>>
>> 4. A replacement for the RFC 3454 criterion
>>
>> A set of rules that satisfies the tests above is as follows. The
>> main bullets give the rule, subordinate bullets (if any) give
>> justifications or examples of things that break if this rule is not
>>
>> present. The term "unstable" means that it fails to satisfy the
>> "remain grouped" property defined above.
>
> remove this, and change all instances to "fail the Label Grouping
> condition".
>
>>
>> Exhaustive testing has verified that strings that satisfy this
>> criterion satisfy both the requirements above at least for all
>>
>> strings up to 6 characters.
>
> [[The above doesn't make it clear that these are in fact the bidi
> requirements that are referred to by the IDNA protocol document. I
> suggest the following rewording:]]
>
> Based on exhaustive testing of strings, the following conditions on
> characters have been developed to meet the Label Grouping and Label
> Uniqueness conditions.
>
> Bidi Label Requirement
>
> A series of labels where any one of them has a character of type R, AL or
> AN must meet all of the following conditions:
>
> [[then number each one for reference]]
>
>>
>> o Only characters with the BIDI properties L, R, AL, AN, EN, ES, BN,
>> ON and NSM are allowed.
>>
>> * B, S and WS are excluded because they are separators or spaces.
>>
>>
>> * LRE, LRO, RLE, RLO, PDF are excluded because they are bidi
>> controls.
>>
>> * ET is excluded because the string L ET is unstable.
>>
>> * CS is excluded because the string L CS is unstable.
>>
>>
>> o ES and ON are not allowed in the first position
>>
>> * ES R and ON R are both unstable.
>>
>> o ES and ON, followed by zero or more NSM, is not allowed in the
>> last position
>>
>> * L ON and L ES are both unstable.
>>
>>
>> o If an L is present, no R, AL or AN may be present, and vice versa.
>>
>> o If an EN is present, no AN may be present, and vice versa.
>>
>> o The first character may not be an NSM
>>
>> o The first character may not be an EN (European Number) or an AN
>>
>> (Arabic Number).
>>
>> * If the character on both sides of a CS is an EN or an AN, the
>> labels turn unstable.
>>
>>
>>
>>
>> Alvestrand & Karp Expires August 17, 2008 [Page 10]
>>
>>
>> Internet-Draft IDNA RTL fix Feb 2008
>>
>>
>> * Some domain names where some of the labels use leading EN and
>> AN may be problem-free, but there's no way of verifying this
>>
>> while looking at a single label in isolation.
>>
>> * NOTE: This is a restriction on ASCII labels when used together
>> with IDNA labels. This is a change from the existing rules for
>> ASCII labels.
>>
>>
>> * We could achieve stability by barring numbers at the end of
>> labels, but this may be more disruptive in practice.
>>
>>
>> 5. Other issues in need of resolution
>>
>> This document concerns itself only with the rules that are needed
>>
>> when dealing with domain names with characters that have differing
>> Bidi properties, and considers characters only in terms of their Bidi
>> properties. All other issues with these scripts have to be
>> considered in other contexts.
>>
>>
>> Another set of issues concerns the proper display of IDNs with a
>> mixture of LTR and RTL labels, or only RTL labels.
>>
>> It is unrealistic to expect that domain names will be written using
>> embedded formatting codes between their labels; thus, the display
>
> According to IDNA2003, it is not only unrealistic but also illegal. Or
> maybe you mean the pre-preprocessed form? This needs fixing, depending on
> what you mean.

Some people have suggested that one should display domain names as 
<label><lrm>.<label>... or similar "interesting" constructs. I think that's 
unrealistic. I'm not sure it's illegal.

>>
>> order will be determined by the bidirectional algorithm. Thus, a
>> sequence (in network order) of R1.R2.ltr will be displayed in the
>> order 2R.1R.ltr in a LTR context, which might surprise someone
>> expecting to see labels displayed in hierarchical order. Again, this
>>
>> memo does not attempt to suggest a solution to this problem.
>>
>>
>> 6. Compatibility considerations
>>
>> 6.1. Backwards compatibility considerations
>>
>> As with any change to an existing standard, it is important to
>>
>> consider what happens with existing implementations when the change
>> is introduced. The following troublesome cases have been noted:
>>
>> o Old program used to input the newly allowed string. If the old
>>
>> program checks the input against RFC 3454, the string will not be
>> allowed, and that domain name will remain inaccessible.
>>
>> o Old program is asked to display the newly allowed string, and
>> checks it against RFC 3454 before displaying. The program will
>>
>> perform some kind of fallback, most likely displaying the Punycode
>> form of the string.
>>
>>
>>
>> Alvestrand & Karp Expires August 17, 2008 [Page 11]
>>
>> Internet-Draft IDNA RTL fix Feb 2008
>>
>>
>>
>> o Old program tries to display the newly allowed string. If the old
>> program has code for displaying the last character of a string
>> that is different from the code used to display the characters in
>>
>> the middle of the string, display may be inconsistent and cause
>> confusion.
>
> I don't understand this. Why would anyone want to do this? While would
> they pick the last character, or the first, or the 3rd from the last, or
> any particular other character and assume they can tell the
> directionality. Why call this out in particular?

I don't know why anyone would want to do this. What I know is that I can't 
guarantee that nobody's done this in existing code, the old programs (if 
any) are already out there, and neither you nor I can get at them to fix 
them.

It's been raised as an issue, so I'll include it. (Same answer as last time 
you asked.)

Thanks for the comments. I'll incorporate some of them into -05.

             Harald