Comments on bidi-04

Mon Mar 3 18:18:49 CET 2008

Comments below.

On Sat, Mar 1, 2008 at 5:10 PM, Mark Davis <mark.davis at icu-project.org>
wrote:
>
> 1.2. Background and history

The history is useful, but shouldn't be right up at the front. The majority
of people reading this document won't care why it is the way it is, they
will care what the spec says and how to use it.

Sections 1.2, and all of 2, should be moved either to the Rationale document
with the rest of the rationale for all the other parts, or at least moved to
an appendix. Given the structure of the documents, Rationale would be
better.

>
> The IDNA specification "Stringprep", [RFC3454] makes the following
> statement in its section 6 on the bidi algorithm, :
>
...

> The justification proposed is this:
>
>
> o No two labels, when presented in display order, should have the
> same sequence of characters without also having the same sequence
> of characters in network order. (This is the criterion that is
>
> explicit in RFC 3454).

The above needs to be qualified, by adding something like "in the same bidi
context". That is, as pointed out below, if you change the embedding
context, 123-456 in one context may look like 456-123 in another; same for
abc.ABC and ABC.abc.

>
> o In a display of a string of labels, the characters of each label
> should remain grouped between the characters delimiting the
> labels.

You need to name both clauses, since you refer to them below. Something
like:

o *Label Uniqueness Condition. *No two labels, when presented in display
order, should have the
...
o *Label Grouping Condition. *In a display of a string of labels, the
characters of each label
...

>
> o These properties should hold true both when the string is embedded
>
> in a paragraph with LTR direction and when it's embedded in a
> paragraph with RTL direction, as long as explicit directional
> controls are not used within the same paragraph.
>
> Several stronger statements were considered and rejected, because
>
> they seem to be impossible to fulfil within the constraints of the
> Unicode bidirectional algorithm. These include:
>
> o The appearance of a label should be unaffected by its embedding
> context. This proved impossible even for ASCII labels; the label
>
> "123-456" will have a different display order in an RTL context
> than in a LTR context.
>
>
>
>
> Alvestrand & Karp Expires August 17, 2008 [Page 7]
>
> Internet-Draft IDNA RTL fix Feb 2008
>
>
>
> o The sequence of labels should be consistent with network order.
> This proved impossible - a domain name consisting of the labels
> (in network order) L1.R1.R2.L2 will be displayed as L1.R2.R1.L2 in
>
> an LTR context.
>
> o The "remain grouped" property should remain true when directional
> controls (LRE, RLE, RLO, LRO, PDF) are used in the same paragraph
> (outside of the labels). Because these controls affect
>
> presentation order in non-obvious ways, by affecting the "sor" and
> "eor" properties of the Unicode BIDI algorithm, the conditions
> above would be very hard to satisfy for an useful set of strings
>
> if this was true. As long as these controls have no influence
> over the display of the domain name, no problem will be caused,
> but the exact criterion for "will not influence" is hard to
>
> codify.

The above is too strong. We didn't actually attempt to see what difference
these would make. Certainly they are a different bidi context. However,
using overrides forces all characters to have the same directionality, so it
is actually *easy* to meet the above criteria in that case, because having
the same order will guarantee -- within *that* context -- both
disambiguation and contiguity.

So I'd suggest just rewording to say that that wasn't a goal.

>
> o The "no two labels display the same" should hold true between LTR
> paragraphs and RTL paragraphs. This was shown to be unsound.

The word "unsound" both here and a few lines down, is the wrong word. You
should either use "untenable" or "impossible" (as you had above).

>
> o No two domain names should be displayed the same, even under
>
> differing directionality. This was shown to be unsound, since the
> domain name (network) ABC.abc will have display order CBA.abc in
> an LTR context and abc.CBA in an RTL context, while the domain
>
> name (network) abc.ABC will display as abc.CBA in an LTR context
> and as CBA.abc in an RTL context.
>

...

> The "remain grouped" property can be more formally stated as:
>
> o Let "Delimiter chars" be a set of characters with the Unicode BIDI
>
> properties CS, WS, ON. (These are commonly used to delimit labels
> - both the FULL STOP and the space are included.)
>
> * ET, though it commonly occurs next to domain names in practice,
> is problematic: the context R CS L EN ET (for instance A.a1%)
>
> makes the label L EN grow unstable.

"grow unstable" should be "become unstable". However, you only define what
"unstable" means below, so either you need to move the definition up. I'd
actually prefer being more explicit, since "unstable" can have multiple
meanings, and just use explicit phrasing like:

makes the label L EN break the Label Grouping condition.

>
> * ES commonly occurs in labels as HYPHEN-MINUS, but could also be
> used as a delimiter (for instance, the plus sign). It is left
> out here.
>
>
> o Let "Position" be the position of a character in a string (in
> network order)
>
> o Let "Bidi position" be the position computed by the Unicode Bidi
> algorithm
>
> In a paragraph with an embedded string formed from the substrings A B
>
> L C D, where A and D are (possibly zero-length) legal labels, and B
> and C are single "Delimiter chars", the label L is a legal label if,
> for all A, B, C and D, the bidi position of all characters in L is
>
> within the range of positions for the characters of L in the string,
> for both the LTR and RTL paragraph direction.

This doesn't make sense to me. "be within the range of positions for the
characters?"

Moreover, you can't say that that makes a label *legal*, since it might not
be legal because of the other conditions in IDNA. The definition is also
circular, since you can only define L to be legal if you know what makes A
and B legal. And the discussion of a "paragraph" may be confusing to people
-- what does a paragraph have to do with a label condition?

I suggest something like:

A label L satisfies the Label Grouping Condition when for any Delimiter
Characters D1 and D2 and any other strings S1 and S2 (possibly of length
zero):

If the string formed by concatenating S1, D1, L, D2, S2 is subject to bidi
reordering,
then all of the characters of L2 in the reordered string are between D1 and
D2.

  - The bidi reordering of L1, D1, L, D2, L2 may result in D2 coming before
D1
  - Because S1 is any string, the bidi algorithm may set the paragraph
direction for the string to either Right-To-Left or Left-To-Right; thus the
reordering condition has to work in both bidi contexts.

>
> (The "zero-length" case represents the case where a domain name is
> next to something that isn't a domain name, separated by a delimiter
>
> character).
>
> The "No two labels" property can be formally stated as:
>
> If two labels L and L', embedded as for the test above, displayed in
> a paragraph with the same directionality, are rearranged into the

rearranged => bidi reordered
>
>
>
>
> Alvestrand & Karp Expires August 17, 2008 [Page 9]
>
> Internet-Draft IDNA RTL fix Feb 2008
>
>
> same sequence of codepoints, neither L nor L' is a legal label.
>
>
>
> 4. A replacement for the RFC 3454 criterion
>
> A set of rules that satisfies the tests above is as follows. The
> main bullets give the rule, subordinate bullets (if any) give
> justifications or examples of things that break if this rule is not
>
> present. The term "unstable" means that it fails to satisfy the
> "remain grouped" property defined above.

remove this, and change all instances to "fail the Label Grouping
condition".

>
> Exhaustive testing has verified that strings that satisfy this
> criterion satisfy both the requirements above at least for all
>
> strings up to 6 characters.

[[The above doesn't make it clear that these are in fact the bidi
requirements that are referred to by the IDNA protocol document. I suggest
the following rewording:]]

Based on exhaustive testing of strings, the following conditions on
characters have been developed to meet the Label Grouping and Label
Uniqueness conditions.

Bidi Label Requirement

A series of labels where any one of them has a character of type R, AL or AN
must meet all of the following conditions:

[[then number each one for reference]]

>
> o Only characters with the BIDI properties L, R, AL, AN, EN, ES, BN,
> ON and NSM are allowed.
>
> * B, S and WS are excluded because they are separators or spaces.
>
>
> * LRE, LRO, RLE, RLO, PDF are excluded because they are bidi
> controls.
>
> * ET is excluded because the string L ET is unstable.
>
> * CS is excluded because the string L CS is unstable.
>
>
> o ES and ON are not allowed in the first position
>
> * ES R and ON R are both unstable.
>
> o ES and ON, followed by zero or more NSM, is not allowed in the
> last position
>
> * L ON and L ES are both unstable.
>
>
> o If an L is present, no R, AL or AN may be present, and vice versa.
>
> o If an EN is present, no AN may be present, and vice versa.
>
> o The first character may not be an NSM
>
> o The first character may not be an EN (European Number) or an AN
>
> (Arabic Number).
>
> * If the character on both sides of a CS is an EN or an AN, the
> labels turn unstable.
>
>
>
>
> Alvestrand & Karp Expires August 17, 2008 [Page 10]
>
>
> Internet-Draft IDNA RTL fix Feb 2008
>
>
> * Some domain names where some of the labels use leading EN and
> AN may be problem-free, but there's no way of verifying this
>
> while looking at a single label in isolation.
>
> * NOTE: This is a restriction on ASCII labels when used together
> with IDNA labels. This is a change from the existing rules for
> ASCII labels.
>
>
> * We could achieve stability by barring numbers at the end of
> labels, but this may be more disruptive in practice.
>
>
> 5. Other issues in need of resolution
>
> This document concerns itself only with the rules that are needed
>
> when dealing with domain names with characters that have differing
> Bidi properties, and considers characters only in terms of their Bidi
> properties. All other issues with these scripts have to be
> considered in other contexts.
>
>
> Another set of issues concerns the proper display of IDNs with a
> mixture of LTR and RTL labels, or only RTL labels.
>
> It is unrealistic to expect that domain names will be written using
> embedded formatting codes between their labels; thus, the display

According to IDNA2003, it is not only unrealistic but also illegal. Or maybe
you mean the pre-preprocessed form? This needs fixing, depending on what you
mean.

>
> order will be determined by the bidirectional algorithm. Thus, a
> sequence (in network order) of R1.R2.ltr will be displayed in the
> order 2R.1R.ltr in a LTR context, which might surprise someone
> expecting to see labels displayed in hierarchical order. Again, this
>
> memo does not attempt to suggest a solution to this problem.
>
>
> 6. Compatibility considerations
>
> 6.1. Backwards compatibility considerations
>
> As with any change to an existing standard, it is important to
>
> consider what happens with existing implementations when the change
> is introduced. The following troublesome cases have been noted:
>
> o Old program used to input the newly allowed string. If the old
>
> program checks the input against RFC 3454, the string will not be
> allowed, and that domain name will remain inaccessible.
>
> o Old program is asked to display the newly allowed string, and
> checks it against RFC 3454 before displaying. The program will
>
> perform some kind of fallback, most likely displaying the Punycode
> form of the string.
>
>
>
> Alvestrand & Karp Expires August 17, 2008 [Page 11]
>
> Internet-Draft IDNA RTL fix Feb 2008
>
>
>
> o Old program tries to display the newly allowed string. If the old
> program has code for displaying the last character of a string
> that is different from the code used to display the characters in
>
> the middle of the string, display may be inconsistent and cause
> confusion.

I don't understand this. Why would anyone want to do this? While would they
pick the last character, or the first, or the 3rd from the last, or any
particular other character and assume they can tell the directionality. Why
call this out in particular?

>
> One particular example of the last case is if a program chooses to
> examine the last character (in network order) of a string in order to
>
> determine its directionality, rather than its first; if it finds an
> NSM character and tries to display the string as if it was a left-to-
> right string, the resulting display may be interesting, but not
> useful.
>
>
> The editors believe that these cases will have less harmful impact in
> practice than continuing to deny the use of words from the languages
> for which these strings are necessary as IDN labels.
>
> This specification forbids using leading European numbers in ASCII-
>
> only labels; this is in conflict with a large installed base of such
> labels. The harm resulting from violating this rule is seen when a
> label at the next level down in the hierarchy ends with a number
> (Arabic or European). Zone managers, both registries and private
>
> zone managers, can check for this particular condition before they
> allow registration of any string with right-to-left characters in it;
> generally it is best to not allow registration of any right-to-left
>
> strings in a zone where the label at the level above begins with a
> digit.
>
> 6.2. Forward compatibiltiy considerations
>
> This text is, intentionally, specified strictly in terms of the
> Unicode BIDI properties. The determination that the condition is
>
> sufficient to fulfil the criteria depends on the Unicode BIDI
> algorithm; it is unlikely that drastic changes will be made to this
> algorithm.
>
> However, the determination of validity for any string depends on the
>
> Unicode BIDI property values, which are not declared immutable by the
> Unicode Consortium. Furthermore, the behaviour of the algorithm for
> any given character is likely to be linguistically and culturally
>
> sensitive, so that it's not unlikely that later versions of the
> Unicode standard may change the bidi properties assigned to certain
> Unicode characters.
>
> This memo does not propose a solution for this problem.
>
>
>
>
>
>
>
> Alvestrand & Karp Expires August 17, 2008 [Page 12]
>
> Internet-Draft IDNA RTL fix Feb 2008
>
>
> 7. IANA Considerations
>
>
> This document makes no request of IANA.
>
> Note to RFC Editor: this section may be removed on publication as an
> RFC.
>
>
> 8. Security Considerations
>
> This modification will allow some strings to be used in Stringprep
>
> contexts that are not allowed today. It is possible that differences
> in the interpretation of the specification between old and new
> implementations could pose a security risk, but it is difficult to
> envision any specific instantiation of this.
>
>
> Any rational attempt to compute, for instance, a hash over an
> identifier processed by Stringprep would use network order for its
> computation, and thus be unaffected by the changes proposed here.
>
>
> While it is not believed to pose a problem, if display routines had
> been written with specific knowledge of the RFC 3454 Stringprep
> prohibitions, it is possible that the potential problems noted under
> "backwards compatibility" could cause new kinds of confusion.
>
>
> The rule about leading numbers, which is more restrictive than
> current practice for domain names, has a peculiar interaction with
> the DNAME record; a DNAME record can point to a zone where right-to-
>
> left labels are registered without the knowledge or consent of the
> zone owner; if the name of the DNAME begins with a number, this can
> cause display of the right-to-left labels in the zone to be
> confusing. It is recommended that DNAMEs pointing to zones allowing
>
> right-to-left labels should not start with a digit, but a pointed-to
> zone owner has no way of enforcing this.
>
>
> 9. Acknowledgements
>
> While the listed editors held the pen, this document represents the
>
> joint work and conclusions of an ad hoc design team. In addition to
> the editors this consisted of, in alphabetic order, Tina Dam, Patrik
> Faltstrom, and John Klensin. Many further specific contributions and
>
> helpful comments were received from the people listed below, and
> others who have contributed to the development and use of the IDNA
> protocols.
>
> The team wishes in particular to thank Roozbeh Pournader for calling
>
> its attention to the issue with the Thaana script, Paul Hoffmann for
>
>
>
> Alvestrand & Karp Expires August 17, 2008 [Page 13]
>
> Internet-Draft IDNA RTL fix Feb 2008
>
>
>
> pointing out the need to be explicit about backwards compatibility
> considerations, Ken Whistler for suggesting the basis of the
> formalized "remain grouped" requirement, and Erik van der Poel for
>
> careful review, comments and verification of the rulesets.
>
>
> Appendix A. Change log
>
> This appendix is intended to be removed when this document is
> published as an RFC.
>
> A.1. Changes from -00 to -01
>
>
> Suggested a possible new algorithm.
>
> Multiple smaller changes.
>
> A.2. Changes from -01 to -02
>
> Date of publication updated.
>
> Change log added.
>
> A.3. Changes from -02 to -03
>
>
> Intro changed to reflect addressing the deeper issues with the Bidi
> algorithm.
>
> Gave formalized criteria for "valid strings", and documented the new
> set of requirements for strings that satisfy the criteria.
>
>
> Removed most of section 5, "Other problems", and noted that this memo
> focuses ONLY on issues that can be evaluated by looking at the bidi
> properties of characters.
>
> A.4. Changes from -03 to -04
>
>
> Added back AN to the list of allowed characters; it had been left out
> by accident in -03.
>
> Removed some rules that were redundant.
>
> Added some considerations for backwards compatibility and interaction
>
> with ASCII labels that start with a number.
>
> Mentioned the issue with DNAME pointing to a zone containing RTL
> labels in the security considerations section.
>
>
>
>
> Alvestrand & Karp Expires August 17, 2008 [Page 14]
>
>
> Internet-Draft IDNA RTL fix Feb 2008
>
>
> Wording updates in multiple places, including some spelling errors.
>
> Rewrote the introduction section.
>
> Split references into "normative" and "informative".
>
>
>
> 10. References
>
> 10.1. Normative references
>
> [I-D.klensin-idnabis-issues]
> Klensin, J., "Internationalizing Domain Names for
> Applications (IDNA): Issues, Explanation, and Rationale",
>
> draft-klensin-idnabis-issues-07 (work in progress),
> February 2008.
>
> [UAX9] Davis, M., "Unicode Standard Annex #9: The Bidirectional
> Algorithm, revision 15", 03 2005.
>
>
> 10.2. Informative references
>
> [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
> Internationalized Strings ("stringprep")", RFC 3454,
> December 2002.
>
>
>
> Authors' Addresses
>
> Harald Tveit Alvestrand (editor)
> Google
> Beddingen 10
> Trondheim, 7014
> Norway
>
> Email: harald at alvestrand.no
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Alvestrand & Karp Expires August 17, 2008 [Page 15]
>
> Internet-Draft IDNA RTL fix Feb 2008
>
>
>
> Cary Karp (editor)
> Swedish Museum of Natural History
> Frescativ. 40
> Stockholm, 10405
> Sweden
>
> Phone: +46 8 5195 4055
> Fax:
> Email: ck at nrm.museum
>
> URI:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Alvestrand & Karp Expires August 17, 2008 [Page 16]
>
>
> Internet-Draft IDNA RTL fix Feb 2008
>
>
> Full Copyright Statement
>
> Copyright (C) The IETF Trust (2008).
>
> This document is subject to the rights, licenses and restrictions
>
> contained in BCP 78, and except as set forth therein, the authors
> retain all their rights.
>
> This document and the information contained herein are provided on an
> "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
>
> OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
> THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
> OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
>
> THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
> WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
>
>
> Intellectual Property
>
> The IETF takes no position regarding the validity or scope of any
>
> Intellectual Property Rights or other rights that might be claimed to
> pertain to the implementation or use of the technology described in
> this document or the extent to which any license under such rights
>
> might or might not be available; nor does it represent that it has
> made any independent effort to identify any such rights. Information
> on the procedures with respect to rights in RFC documents can be
> found in BCP 78 and BCP 79.
>
>
> Copies of IPR disclosures made to the IETF Secretariat and any
> assurances of licenses to be made available, or the result of an
> attempt made to obtain a general license or permission for the use of
> such proprietary rights by implementers or users of this
>
> specification can be obtained from the IETF on-line IPR repository at
> http://www.ietf.org/ipr.
>
> The IETF invites any interested party to bring to its attention any
>
> copyrights, patents or patent applications, or other proprietary
> rights that may cover technology that may be required to implement
> this standard. Please address the information to the IETF at
> ietf-ipr at ietf.org.
>
>
>
> Acknowledgment
>
> Funding for the RFC Editor function is provided by the IETF
> Administrative Support Activity (IASA).
>
>
>
>
>
> Alvestrand & Karp Expires August 17, 2008 [Page 17]
>
>
>
>
>

-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080303/43dc52e4/attachment-0001.html