Follow-up to Monday's discussion of digits

Sun Nov 23 16:36:49 CET 2008

Ken,

Could you point out where the "slightly different *directional* 
behavior, with distinct bidi properties" is present? I'm reading TR9 but 
I don't yet see where this matters. Examples of the differences would be 
nice.

Eric

Kenneth Whistler wrote:
>> I agree with this, and think that the restriction _could_ be done only
>> by regstries.  
>>     
>
> And that is what I think should be done.
>
>   
>> That said, the example case may be unusual enough that
>> it is worth pushing into the protocol.  
>>
>> I'll probably get the terminology wrong in what follows, but I my
>> current understanding is that the various ranges of digits always
>> contain the complete set of digits,
>>     
>
> Yes.
>
>   
>> even if the digits are really
>> shared.
>>     
>
> No. There are no such instances of digit "sharing" between
> runs of digits in the standard.
>
>   
>> In other words I think that the extended and non-extended
>> Arabic-Indic ranges in some sense contain three characters that would
>> have been the same code point had they not been digits.
>>     
>
> I don't think that is a correct interpretation.
>
> Arabic-Indic digits (U+0660..U+0669, in use for Arabic
> proper and most of the Arabic-script world outside of
> Central and South Asia) and the Eastern Arabic-Indic
> digits (U+06F0..U+06F9, in use for Perso-Arabic, essentially
> in Iran, Pakistan, and Afghanistan), are each complete
> sets of digits.
>
> The two sets have slightly different *directional* behavior,
> with distinct bidi properties. Arabic-Indic digits are bc=AN,
> while Eastern Arabic-Indic digits are bc=En. That distinction,
> more than anything, was what required encoding two distinct
> sets. It also happens that the *glyphs* for several of the
> numbers (4 through 7, primarily) are also often distinct
> for Arabic and for Perso-Arabic.
>
>   
>>  It's a good
>> thing that the code points for digits are always in a contiguous
>> range, but it has created this unusual case that happens to be bad for
>> domain name label use.  Is that correct?
>>     
>
> Only partly so. You could switch over to the scripts of India
> and note that the digit zero for each of the scripts is
> essentially the "same" little circle, and claim that that
> fact could lead to combinatorial explosion if you started
> mixing those across scripts in domain names as well.
>
> Mixing scripts for digits in strings is just bad in principle.
> The only thing that is really different about the Arabic
> digits case is that both the Arabic-Indic digits and
> the Eastern Arabic-Indic digits are technically
> script=Arabic in the Unicode Standard, so you don't rule
> out mixing them in a label simply with a script 
> property test.
>
> But unless I'm missing something, we've given up trying
> to mandate the no mixing of scripts in label principle
> in the protocol itself, assuming that that will be
> handled by registry policy. And I see the Arabic digits
> as just another example of "bad things can happen if
> you let people mix things that could be confused".
>
> The natural thing one would expect would be for registries
> for Iran, Pakistan, and Afghanistan to only allow
> U+06F0..U+06F9 digits in Arabic labels, and other
> registries outside of those countries to only allow
> U+0660..U+0669 digits in Arabic labels.
>
> On the other hand, if the group consensus is that the
> protocol *must* do something, the two sets of digits
> are essentially a *bidi* issue, and the cleanest way
> to account for a prohibition on their cooccurrence in
> a label would be to leverage the already existing
> requirement that labels pass the string contraints
> of the bidi document, and simply add a clause in there
> which would prohibit the cooccurrence of Arabic digits
> with different bidi class in the same label, instead
> of inventing new, complicated context rules and
> special case status for a particular list of characters
> elsewhere in the documents.
>
> --Ken
>
>   
>>  Because if I understand this
>> correctly, it's sufficiently unlike other cases that treating it
>> specially in the protocol might be the right trade-off.  This "strange
>> case", after all, is part of what was the motivation behind having
>> context rules in the first place, no?
>>     
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>
>