Mixing of AN and EN (Re: Protocol-08 (and status of Defs-04 and Rationale-06))

Fri Dec 19 21:04:15 CET 2008

Hi Erik,

 > Hi Alireza,
 >
 > On Tue, Dec 16, 2008 at 12:59 AM, Alireza Saleh <saleh at nic.ir> wrote:
 >  
 >> Would you please test the following examples :
 >>
 >> 1) <U+062C>a<U+U0664>-<U+0665>
 >> 2) <U+062C>a<U+U06F4>-<U+06F5>
 >>    
 >
 > There are no dots in these examples, so we cannot test whether any
 > characters jump over a dot. Also, these labels have different
 > characters, so they will never be the same, no matter how you re-order
 > them. See "Label Uniqueness" and "Character Grouping" in:
 >
 > http://tools.ietf.org/html/draft-ietf-idnabis-bidi-03#section-3
 >
 >  

I wrote those test cases to be evaluated against Harald's implementation
of the TR9.
My understanding of Harald's answer to Mark, he implemented TR9 report 
and tested the Mark’s test cases against it, so the examples could be 
without dot.

 >> I'm a little bit confused about what -bidi is aiming at  ?  This is 
the RFC
 >> that going to be used to display the labels, or the labels should be 
tested
 >> against it during the registration ?
 >>    
 >
 > For registration, the IDNA2008 bidi rules are a MUST, but for lookup,
 > they are a SHOULD:
 >
 > http://tools.ietf.org/html/draft-ietf-idnabis-protocol-08
 >
 > The aim of the IDNA2008 bidi draft is to prevent registration of
 > labels that would cause confusion in the context of a full domain name
 > (more than one label). As you know, there was a long discussion about
 > whether or not we can require testing across multiple labels at
 > registration time, so we now have the wretched compromise of only
 > requiring testing within a single label.
 >
 > The IDNA2008 bidi draft is not the only one that is trying to avoid
 > confusion. I believe the Rationale draft also says that IDNA2008 is
 > trying to avoid general confusion by restricting labels to letters,
 > digits and the hyphen:
 >
 > http://tools.ietf.org/html/draft-ietf-idnabis-rationale-06#section-1.4
 >
 >
 >  

As you say, the protocol it trying to prevent registration of labels
that would cause visual confusion. As you know the term ‘Registration’
has special meaning according to the protocol documents. If there is a
matter of confusion, why not give the responsibility of preventing
visual confusions ( all, not just part of it ) during the registration 
to Registries. This is specially true because the protocol neither 
claims that 
these rules prevent confusions nor does it claim that all cases that 
match the rules are confusable labels.

In the protocol we have a text that says AN and EN cannot be mixed
within the label, or the label cannot start with EN or AN. As the reason
for having these rules the protocol says :

Some domain names where some of the labels use leading EN and
AN may be problem-free, but there's no way of verifying this
while looking at a single label in isolation

I think the protocol should look positively. The domains that fall into
this condition may be requested by some companies that have registered 
those as their trademarks . I think that instead of having these rules 
because some domains MAY be problematic, we can encourage and help 
registries by documents such as Rationale, toward having good policy and 
good definition of languages they are supposed to support. By this rule 
having something like IDN reverse delegation becomes impossible.

I have also found the following as a reason that blocks labels
starting with a digit :

If the character on both sides of a CS is an EN or an AN, the
labels fail the Character Grouping requirement.

Is this statement true for all cases? As far as I know, there are many
examples that pass the grouping requirements but will be  blocked because
of this rule. For example: ۱۲۳.۱۲۳.com

I also don't agree with the following answer by Harald:

' What is your reason to believe that domains are an LTR context? The
idea that domain names may occur in free text has been a basic
assumption behind the bidi work. If they didn't, the document would be a
lot shorter. '

I think every text areas has a default direction, and some of them may
change by detecting the input text.
If I correctly recall the Haralds answer during the IETF Dublin meeting,
he said ' -bidi is designed to prevent confusions in any text editors
that you write the label not only the URL parser applications'. As far
as I know in the West no one types in RTL areas, but it surely will become
shorter if we assume the default will be RTL. [do you mean LTR]

I also disagree with preventing the mix of AN and EN in the protocol:
First because they have different shapes for 4,5,6. Second, this rule 
does not prevent confusion in other cases.  Consider : ج۱۲۳.ب١٢٣.com as 
an example. Is it possible for a user who looks at this label to tell 
which digit sets has been used in which label ? And also tell which 
label was entered first? If the user can answer these questions by 
looking at this domain s/he can also find out if I mix the digits.

I think this rule not only does not solve the problem but also brings
more confusion to the users. The registries still need some rules on
top of the protocol.

Different registries have different requirements depending on the 
variety of languages they wish to support. Some may require mixed AN, EN 
labels. Let us just design safe guidelines to serve the communities.

Best Regards
Alireza