Comments on IDNA Bidi

Wed Jan 16 08:22:59 CET 2008

Michel Suignard skrev:
>> From: Harald Alvestrand [mailto:harald at alvestrand.no]
>> Sent: Tuesday, January 15, 2008 12:06 AM
>>
>> The investigations I've done so far tell me that the IDNA2003 rules are
>> are far too weak, especially when it comes to the handling of AN
>> (Arabic Number) - there are MANY strings that people will want to have
>> valid in some context that will break in "interesting" ways when the
>> Unicode BIDI algorithm is applied to a paragraph containing such a
>> string used in the ways we usually use a domain name. So in addition to
>> allowing trailing NSM, we need to tighten up the rules in other ways.
>>
>> I'm not going to recommend any specific ruleset until I have some
>> confidence that this ruleset actually specifies a set of strings that
>> is both "safe" (passes some reasonable set of tests) and useful.
>>     
>
> Harald,
> My take on this is that if something breaks in 'interesting' ways we can't allow it in a domain, this is why we have now encapsulation requirement with RCat to avoid interaction between labels across weak separators and not so funny side effects with the AN characters. So in the balance between useful and safe we always have to pick the safe path.
>   
Thanks for that advice. Good to see that some people prefer the safe path.

Would you support banning leading English numbers from ASCII domain
names? They break badly when put next to an RTL label.
> I am not sure to understand fully why you think that the current are far too weak. As often they are a compromise and were crafted/reviewed by bidi experts such as Matti Allouche, Jonathan Rosenne, Martin Duerst and others. They are also used in the IRI RFC and have even more detailed considerations there. Interesting to note that an update in idna will probably require a similar one in IRI because the problems are very similar between a domain name and a resource identifier (mix of strong and weak bidi types). It is true that the scenario involving ending combining marks (NSM) was missed because probably they were no Yiddish and Dhivehi speakers in the reviewer list at that time.
>   
Your trust in the quality of the 2003 review is touching.
> In fact I think that as of now the bidi rules are almost 'too' tight and are limiting the usage side which is unfortunate but necessary for security reasons. So I am not sure how we could make them even tighter. But w/o concrete cases I am probably speculating.
>
> This is why I am favor in minor but essential changes to the rules that preserve mostly what we have now. And it was my belief that the current rules had been reasonably validated with few exceptions such as the case of ending combining marks.
>   
As far as I can tell, nobody tested the interaction between AN (Arabic
Numeral) and domain name delimiters. They break badly - and in labels
without any AL characters around, the rules in IDNA2003 simply don't
come into play.

Neither do I believe that anyone seriously looked at the interactions
with ON before suggesting that "middle dot" be allowed in domain names,
or even ES (European Separator), of which the hyphen-minus (-) is the
prime example of a character used in domain names.

It took me, a complete neophyte to the art of BIDI programming,
approximately 4 days of work to come up with code to generate what I
believe are the problematic cases. If those who have worked with BIDI
for years had reviewed the 2003 specification with an eye to finding the
corner cases that might cause problems, we might not have had a current
problem.

Some concrete examples:

A label consisting of "L ET" breaks apart in an RTL context when
embedded as L CS L ET CS.
A label consisting of "EN AN" breaks apart in an LTR context when
embedded as CS EN AN CS R.

I found eleven examples of 2-letter strings that would be permitted
under the rules you proposed, but break apart when displayed either in
an LTR context, an RTL context, or both. For 3-letter strings, I have
125 examples.

                   Harald