Arabic digits

Thu Dec 4 07:00:16 CET 2008

Hi.

During the last few days, I've managed to read, three times,
through the entire set of threads on this subject on this list
and a separate thread on the Arabic script list. 

These notes are written strictly as an individual participant in
the WG who is anxious to get this issue resolved correctly and
quickly, not as a document editor or in any other WG role.  They
attempt to summarize where we stand and what the various
arguments are, rather than to recommend a specific position or
outcome.

I am more confused than ever and perhaps more by the discussion
than about possible outcomes.

(1) No one involved with the Arabic script has suggested a
prohibition on mixing the Arabic-Indic digits and the Extended
Arabic-Indic digits alone.  The debate about whether those can
reasonably be separated by means of a registry restriction is
irrelevant to the original proposal which was that, if any
character from either set of Arabic-Indic digits appeared, no
character from _either_ the other set or the European digit set
could appear in the label.   The goal was less to keep the two
types of Arabic-Indic digits separated from each other than to
keep either of the two from mixing with European digits.  

(2) To the extent to which a set of digits are associated with a
script, a registry-level restriction on mixing scripts (whether
strictly using a Unicode definition of script boundaries or
allowing some flexibility based on popular usage) is almost
certainly sufficient to deal with any digit issues.  Without
judging (in this paragraph) where any issues should be resolved,
digits are special only if they are either not associated with a
script or if a given script is associated with more than one set
of digits.   We have agreed, at least so far, that any
prohibitions on mixed scripts, as mixed scripts, in labels are a
registry problem and one that, because of many different local
conventions that mix Unicode Scripts in local practice, cannot
be resolved in the Protocol and Tables even if we wanted to do
that.  However, for Arabic script, a "one script" rule is not
sufficient because the Arabic script category (and blocks)
contains two full sets of digits.  It appears to be unique among
scripts in that regard.

(3) The compelling issue that may justify some protocol-level
treatment here is summarized in one sentence of one of Abdulaziz
Al-Zoman's notes (although others have said much the same
thing): "Users type digits without knowing the internal coding
used".   Assume that Microsoft made a decision to encode all
digits as European ones, regardless of how they are typed, and
did so for the reasons he cites in his note.  Assume also that
there are some Unix-derived systems that examined the tradeoffs
and made the other decision, i.e., to encode the characters as
typed, rather than converting to a common form.  At least for
Arabic script and the three sets of digits (Arabic-Indic,
Eastern Arabic-Indic, and European), those assumptions are
almost certainly true.  Then we have a problem, not because of
something done on one system or the other, but because what they
have done is not the same.  The tradeoff in digit handling if
one wants numerically-oriented applications to work
independently of how the digits are written is actually fairly
clear:

	(i) Code all numerals the same way regardless of how
	they are entered, treating the display of those numerals
	as a localization issue.

	(ii) Code numerals as written, then resolve the
	differences in coding when the character-coded numerals
	are converted to digital-numeric form and, as
	appropriate, back.

	(iii) Abandon that goal on the theory that there is
	really no more reason to have an Arabic spreadsheet
	application interoperate with an English (or Chinese)
	spreadsheet application without translation than there
	is to have text files in the three languages
	automatically translated to each other on viewing.  A
	different way to state the third model is that one would
	need to translate an Arabic spreadsheet, characters and
	all, to use it in an English environment, just as one
	would have a translate several paragraphs of Arabic text.

My experience with, and intuitions about, the design of
operating systems is that the first option will inevitably lead
to bad results unless the operating system designers control all
other applications to be run on the system and maybe even if
they have that level of control.  I can, however, make arguments
for any of  the three cases and I'd be very surprised if
Microsoft or, for that matter, the U**x system implementers,
cared about my opinions on this subject.

If Microsoft and the Unix systems are doing this across several
generations of their operating systems, including the
contemporary ones, and have no intention of making changes, I
think we are pragmatically obligated to deal with that
particular reality, just as we are obligated to deal with
contemporary operating environments for which Unicode is not the
primary input-output CCS.  How we feel about them aesthetically
or morally is really not an interesting issue.

I would appreciate if if someone within Microsoft who is
following this list could confirm the coding decision (I've
confirmed the other behavior for a couple of Unix-derivatives so
far, but can't guarantee that some are not different).   If I
correctly read the material Abdulaziz quoted, if I enter
Bengali, Devanagari, Thai or other digits through an appropriate
input method editor, I also get European digits in the data
file.   If that is not exactly what is happening, a better
explanation from a Microsoft perspective would be extremely
helpful, especially since my impression is that, at least
pre-Vista, decisions made for some localized versions of Windows
did not necessarily have exact parallels for other localized
versions. 

(4)  Let's assume that (3) is correct and that Microsoft (and
maybe others) have chosen (3)(i) and expect to stick with it
(such decisions are _very_ hard to change once made).   Let's
also assume that the same decision was not made by all other
operating systems, with others preferring that applications be
responsible for translation (or transcoding) of numerals using
the approaches of Case 3(ii) or 3(iii) above.    Let us then
consider two users, W and U, in some Arabic-speaking country,
who use Windows and an appropriate Unix-flavor respectively.
For W, any digits he types end up coded as European ones.  For
U, she can type either Arabic-Indic or European digits and have
it retained into files and URIs (she can also type any other
form of digit, with a greater or lesser degree of difficulty ...
and the European digits might possibly be hard to type).  A
domain name created by U and containing Arabic-Indic characters
in coded form cannot be typed by W because, if W enters
Arabic-Indic digits, they will be mapped to European ones.
Worse, given the worldwide tendency toward majority Windows
client systems and majority Unix (or Unix-derived) Web and email
servers, creation of strings by U that W cannot access is
actually likely to be a common occurrence.  

In retrospect, I assume that the above is the reason for the
initial recommendation from the Arabic Language WG that all
digits be mapped into European ones by the protocol (following
the Microsoft practice).   Of course, to be really practical, if
Microsoft actually codes _all_ digits to European ones, that
convention would need to be applied to any digit in any script.
That is clearly implausible at this stage, even if, in
retrospect, it turned out to be the right thing to do.

Viewed this way, the issue is not about what one vendor has or
has not done, but about different vendors examining the
tradeoffs between what we have called "digits are digits" and
exact representation of characters without information loss and
reaching different conclusions.  Those tradeoffs include the
various possible decisions about what code points one actually
gets when typing specific characters on a given keyboard setup.
To the extent to which that is the case, it makes sense for us
to do whatever helps maximize registry choice about what should
match and what should not.

Note that this issue is only very indirectly related to visual
confusability, phishing, etc.  If it were only about those
issues, or about so-called homographs, it would not be worth
having except possibly as recommendations to registries.  It is
a fundamental issue about the integrity of identifiers, the
ability to access domain names from within a single script and
language, and so on.

However, a prohibition on digit-type-mixing doesn't help us very
much in the example above:  If U codes digits strictly in
Arabic-Indic form and W codes them strictly in European form,
then they aren't going to be able to communicate regardless of
any prohibitions on mixing (regardless of where those
prohibitions occur and are enforced).  While that is not an
argument against a prohibition -- the inability to solve all
possible problems should not prevent us from solving those that
we can-- it does imply that we need to be careful about our
expectations.   Communication will be possible only if the
relevant registries enter labels with the same ownership and
destinations and if any testing on equivalence of URIs
recognizes the issues and deals with the two forms as equivalent.

(5)   Independent of the internal coding issues described above,
we have cases in both the Western Arabic Script world (e.g.,
Arabic Language) and the Eastern one (e.g., Pakistan and, I
gather from Sarmad's comments, several languages) in which
European digits are used interchangeably with local Arabic-Indic
ones (Arabic-Indic digits in the West and Extended Arabic-Indic
ones in the East).  But we have no users of Arabic script who
have described a _requirement_ case for mixing either
Arabic-Indic digits or either one with European digits in the
same label (independent of speculation about what some people
might want to do).

(6) Several notes indicated that we had a protocol-level rule
against mixing scripts.   There must be some confusion, because
there is no such rule.  Indeed, it would be nearly impossible to
write in a general way because, for some languages and
registries, script-mixing is fairly common (and not particularly
confusing).  Use of Romanji (Latin Characters) embedded in
Japanese strings is one common example.

Is that a reasonable enough summary of the discussions that we
can start to use it as a basis for making some final (insofar as
things are ever final) decisions?

       john

p.s. I've also found some confusion on the list as to whether
the proposal was about "protocol" versus "contextual" rules
versus "registry restrictions".   The first two are the same,
just different ways of expressing how the data are organized.
Registry restrictions are, of course, different.