comments on draft-ietf-idnabis-bidi

Thu Jul 23 08:00:57 CEST 2009

In my previous suggestions, I did not take in consideration that the rules 
are meant to codify also labels which do not contain any RTL characters. 
Having understood that, here is an updated version of my suggestions:

Definitions:

1. Bidi domain names are domain names which include at least one RTL 
label.

2. A RTL label is a label which contains at least one character of type R 
or AL or AN.

Rules for RTL labels in Bidi domain names:

   1.  Only characters with the BIDI properties R, AL, AN, EN, ES,
       CS, ET, ON, BN and NSM are allowed in RTL labels.

   2.  The first position must be a character with Bidi property R or AL.

   3.  The last position must be a character with Bidi property R, AL, EN
       or AN, followed by zero or more NSM.

   4.  If an EN is present, no AN may be present, and vice versa.

Rules for non-RTL labels in Bidi domain names:

   1.  Only characters with the BIDI properties L, EN, ES,
       CS, ET, ON and NSM are allowed in non-RTL labels.

   2.  The first position must be a character with Bidi property L.

   3.  The last position must be a character with Bidi property L or EN,
       followed by zero or more NSM, or the two last positions must be 
       EN followed by ET.

Shalom (Regards),  Mati
           Bidi Architect
           Globalization Center Of Competency - Bidirectional Scripts
           IBM Israel
           Phone: +972 2 5888802    Fax: +972 2 5870333    Mobile: +972 52 
2554160

Harald Alvestrand <harald at alvestrand.no> 
20/07/2009 21:01

To
Matitiahu Allouche/Israel/IBM at IBMIL
cc
idna-update at alvestrand.no
Subject
Re: comments on draft-ietf-idnabis-bidi

Thank you for your input, and apologies for losing track of this for 
several months.

Matitiahu Allouche wrote:
>
> My attention was recently drawn to the subject document (version 03) 
> and I have a number of comments.  Some of them are very minor (typos, 
> editorial) and reflect my pedantic mind, but I thought that I could as 
> well help improve the form of the document.  Other comments touch more 
> to the essence, and I will appreciate considering them seriously.
>
> 1) In section 2, first paragraph, "satisifes" should be "satisfies".
Thanks!
>
> 2) Section 2, rule 1 mentions the "Character Grouping requirement" for 
> the first time in the document.  Either there should be a forward 
> reference to section 3 where it will be explained, or (better, in my 
> opinion), the content of the current section 3 should precede the 
> content of the current section 2.
These sections were swapped earlier at the request of other 
participants. A forward pointer has been added.
>
> 3) In the sentence "ET is excluded because the string L ET does not 
> satisfy the Character Grouping requirement.", "L" seems to represent a 
> label, but can easily be confused with the L Bidi property (all the 
> more since it is adjacent to ET which surely represents a character 
> with the ET Bidi property).
L is intended to represent the L bidi property. See section 3 of the 
terminology section.
>
> 4) In the sentence "CS is excluded because the string L CS does not 
> satisfy the Character Grouping requirement.", "L" seems to represent a 
> label, but can easily be confused with the L Bidi property (all the 
> more since it is adjacent to CS which surely represents a character 
> with the CS Bidi property).
Same comment as above.
>
> 5) I see no reason why CS is excluded while ES is allowed.  Both can 
> be the source of the same kind of  violation of the Character Grouping 
> requirement.  ES characters are excluded from the first and last 
> positions by rules 2 and 3.  With the same restrictions (exclusion 
> from the first and last positions), ES and ET characters can be 
> allowed and will not violate the Character Grouping requirement any 
> more than ES characters.
ES includes the hyphen character. Disallowing hyphens in labels seemed 
like a too radical step to take, so we decided to allow it.

> 6) In section 1.1, there appears the following statement: "This 
> specification is not intended to place any requirements on domain 
> names that do not contain right-to-left characters."
> Also the title of section 2 is "A replacement for the RFC 3454 BIDI 
> rule" which implies that the text only deals with "Bidi" labels.
> If that means that the specification applies only to labels which 
> contain at least one character with Bidi property R, AL or AN, and we 
> combine that with rule 4 "If an R, AL or AN is present, no L may be 
> present.", then an L character can never be part of a Bidi label, and 
> the L should be removed from the list of allowed Bidi properties in 
> rule 1.
It applies to domain names where some of the labels contain 
right-to-left characters. Other labels in the same domain name can 
contain L characters.
>
> 7) In [UAX9], rule X9 says that BN characters must be removed from the 
> displayed text.  Any such invisible character violates the Label 
> Uniqueness requirement.  BN characters must not be allowed by rule 1.
BN includes ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER, which we have 
been most explicitly told need to be included for certain labels in 
Arabic script.
Therefore, we cannot disallow this class.
>
> 8) From rules 1, 2, 4, 6 and 7, plus our comments 6 and 7 above, it 
> results that the first character of a Bidi label can only be of type R 
> or AL.  Such a statement can advantageously replace rules 2, 6 and 7.
Unfortunately this is not true. L, R and AL are all allowed.

I do agree that rules 2, 6 and 7 can be more succintly formulated as 
such a statement - the advantage of the present formulation is that it 
gives a convenient hook on which to hang the long explanation of the 
issue with numbers - but that can be dealt with in other places.
>
> 9) Rule 5 includes no justification.  While a mixture of AN and EN 
> characters in the same label seems odd and not required in real life 
> situations, it is not clear what requirement would be violated by such 
> a combination.
When running tests and removing this rule, labels that broke the 
requirements were produced.
>
> 10) The rules allow AN or EN digits to appear in the last position of 
> a label (in opposition to RFC 3454).  Let us consider the following 
> examples (where lower case letters represent L characters and upper 
> case letters represent R or AL characters):
>
>    a. network order = "ABC123.456xyz"  display order (LTR) = 
> "123.456CBAxyz"  display order (RTL) = "123.456xyzCBA"
>
>    b. network order = "ABC.456-xyz"  display order (LTR) = 
> "456.CBA-xyz"  display order (RTL) = "xyz-456.CBA"
>
>    c. network order = "ABC123.456.xyz"  display order (LTR) = 
> "123.456CBA.xyz"  display order (RTL) = "xyz.123.456CBA"
>
>    d. network order = "ABC.456.xyz"  display order (LTR) = 
> "456.CBA.xyz"  display order (RTL) = "xyz.456.CBA"
>
> Examples a, b and c show very ugly violations of the Character 
> Grouping requirement.  Since the document does not place requirements 
> on non-Bidi labels, any non-Bidi label starting with digits following 
> a Bidi label will cause a Character Grouping violation.
This is true, but only if you allow LTR labels in a BIDI domain name 
that do not follow the BIDI rule.

This has been the subject of much discussion, and the last paragraphs of 
section 2 ("The following guarantees can be made.." is the position 
reached on this subject. I am very unwilling to reopen this.
>  If Bidi labels are restricted from ending with digits (optionally 
> followed by NSMs), then non-Bidi labels which contain only digits 
> (example d) following a Bidi label will not cause a Character Grouping 
> violation.
Agreed, this is discussed extensively in section 5.
> Whether this modest benefit justifies imposing such a restriction is 
> subject to discussion.
>
> 11) Towards the end of section 2, there appears the following 
> sentence: "In a domain name consisting of only labels that pass the 
> test, the requirements of Section 3 are satisfied."
> This is not true for domain names like in the examples above, unless 
> non-Bidi labels are excluded, which is a very hard constraint.
Those domain names have labels that do not satisfy the criterion.
>
> 12) The next sentence says: "In a domain name consisting of only 
> LDH-labels and labels that pass the test, the requirements of Section 
> 3 are satisfied as long as a label that starts with an ASCII digit 
> does not come after a right-to-left label that ends in a digit."
> This is not true.  See example b above.
You are right. This needs to be documented; I did not test this case.
>
> 13) In section 3, there appears the sentence: "the label "123-456" 
> will have a different display order in an RTL context than in a LTR 
> context."
> This is not true, IMHO.  If the last letter before the label is not an 
> Arabic Letter, it will be displayed as "123-456" both in LTR and RTL 
> context.  If it is an Arabic Letter, it will be displayed as "456-123".
I will have to test this. Thanks for pointing it out.
>
> 14) In section 3, there appears the sentence: "The Label Uniqueness 
> property should hold true between LTR paragraphs and RTL paragraphs. 
>  This was shown to be unsound."
> In fact, in all cases where Character Grouping and Label Uniqueness 
> are satisfied for each paragraph direction separately, there will be 
> Label Uniqueness between LTR and RTL paragraphs.
I will have to test this. I think a fairly common case was found (ALEPH 
1 / 1 ALEPH comes to mind, but 1 ALEPH is disallowed). Since this was 
ruled out of context early on, I don't think either my code or Erik's 
code checks for this at the moment.
>
> 15) In section 3, since an "unproblematic label" can be a label which 
> satisfies the requirements, the clause "any label S1 and S2 that is 
> either a label satisfying the requirements or an unproblematic label" 
> can be shortened to "any label S1 and S2 that is an unproblematic 
label".
Good simplification. Thanks!
>
> 16) In the formal statement of the Label Uniqueness requirement, there 
> is no provision (or exclusion) for the case where L and L' are 
identical.
Thanks - I'll make this "two non-identical labels".
>
> 17) In summary I suggest that the rules in section 2 should be 
> reformulated as below.
>
>    1.  Only characters with the BIDI properties R, AL, AN, EN, ES,
>       CS, ET, ON and NSM are allowed in RTL labels.
>
>   2.  The first position must be a character with Bidi property R or AL.
>
>   3.  The last position must be a character with Bidi property R or AL,
>        followed by zero or more NSM.
>
>   3 variant.  The last position must be a character with Bidi property 
R,
>      AL, EN or AN, followed by zero or more NSM.
>
>   4 (debatable).  If an EN is present, no AN may be present, and vice
>       versa.
>
> It can be seen that this formulation is quite close to that in RFC 
> 3454, while solving all the problems that the subject document aims to 
> solve.
I will have to review this proposal more carefully given the comments 
above, and update my code to verify it. Will not do so this week, 
unfortunately.

Again, thanks for your comments, and apologies for losing track of them.

                     Harald