Follow-up to Monday's discussion of digits

Thu Nov 20 00:45:19 CET 2008

> I agree with this, and think that the restriction _could_ be done only
> by regstries.  

And that is what I think should be done.

> That said, the example case may be unusual enough that
> it is worth pushing into the protocol.  
> 
> I'll probably get the terminology wrong in what follows, but I my
> current understanding is that the various ranges of digits always
> contain the complete set of digits,

Yes.

> even if the digits are really
> shared.

No. There are no such instances of digit "sharing" between
runs of digits in the standard.

> In other words I think that the extended and non-extended
> Arabic-Indic ranges in some sense contain three characters that would
> have been the same code point had they not been digits.

I don't think that is a correct interpretation.

Arabic-Indic digits (U+0660..U+0669, in use for Arabic
proper and most of the Arabic-script world outside of
Central and South Asia) and the Eastern Arabic-Indic
digits (U+06F0..U+06F9, in use for Perso-Arabic, essentially
in Iran, Pakistan, and Afghanistan), are each complete
sets of digits.

The two sets have slightly different *directional* behavior,
with distinct bidi properties. Arabic-Indic digits are bc=AN,
while Eastern Arabic-Indic digits are bc=En. That distinction,
more than anything, was what required encoding two distinct
sets. It also happens that the *glyphs* for several of the
numbers (4 through 7, primarily) are also often distinct
for Arabic and for Perso-Arabic.

>  It's a good
> thing that the code points for digits are always in a contiguous
> range, but it has created this unusual case that happens to be bad for
> domain name label use.  Is that correct?

Only partly so. You could switch over to the scripts of India
and note that the digit zero for each of the scripts is
essentially the "same" little circle, and claim that that
fact could lead to combinatorial explosion if you started
mixing those across scripts in domain names as well.

Mixing scripts for digits in strings is just bad in principle.
The only thing that is really different about the Arabic
digits case is that both the Arabic-Indic digits and
the Eastern Arabic-Indic digits are technically
script=Arabic in the Unicode Standard, so you don't rule
out mixing them in a label simply with a script 
property test.

But unless I'm missing something, we've given up trying
to mandate the no mixing of scripts in label principle
in the protocol itself, assuming that that will be
handled by registry policy. And I see the Arabic digits
as just another example of "bad things can happen if
you let people mix things that could be confused".

The natural thing one would expect would be for registries
for Iran, Pakistan, and Afghanistan to only allow
U+06F0..U+06F9 digits in Arabic labels, and other
registries outside of those countries to only allow
U+0660..U+0669 digits in Arabic labels.

On the other hand, if the group consensus is that the
protocol *must* do something, the two sets of digits
are essentially a *bidi* issue, and the cleanest way
to account for a prohibition on their cooccurrence in
a label would be to leverage the already existing
requirement that labels pass the string contraints
of the bidi document, and simply add a clause in there
which would prohibit the cooccurrence of Arabic digits
with different bidi class in the same label, instead
of inventing new, complicated context rules and
special case status for a particular list of characters
elsewhere in the documents.

--Ken

>  Because if I understand this
> correctly, it's sufficiently unlike other cases that treating it
> specially in the protocol might be the right trade-off.  This "strange
> case", after all, is part of what was the motivation behind having
> context rules in the first place, no?