Arabic digits

Thu Dec 4 09:55:01 CET 2008

Short summary: Mostly correct. My conclusion (already expressed
in an earlier mail today) is that the best way to deal with the
equivalence problem is to do automatic parallel registration,
without affecting the protocol.

At 15:00 08/12/04, John C Klensin wrote:
>Hi.
>
>During the last few days, I've managed to read, three times,
>through the entire set of threads on this subject on this list
>and a separate thread on the Arabic script list. 
>
>These notes are written strictly as an individual participant in
>the WG who is anxious to get this issue resolved correctly and
>quickly, not as a document editor or in any other WG role.  They
>attempt to summarize where we stand and what the various
>arguments are, rather than to recommend a specific position or
>outcome.
>
>I am more confused than ever and perhaps more by the discussion
>than about possible outcomes.

In such a long discussion, there are always some misunderstandings
and typos.

>(1) No one involved with the Arabic script has suggested a
>prohibition on mixing the Arabic-Indic digits and the Extended
>Arabic-Indic digits alone.  The debate about whether those can
>reasonably be separated by means of a registry restriction is
>irrelevant to the original proposal which was that, if any
>character from either set of Arabic-Indic digits appeared, no
>character from _either_ the other set or the European digit set
>could appear in the label.   The goal was less to keep the two
>types of Arabic-Indic digits separated from each other than to
>keep either of the two from mixing with European digits.  

This is probably true. But except for the fact that we then
don't really speak so much about visual confusability, the
basic nature of the problem (looked at in "mathematical" terms)
is the same.

>(2) To the extent to which a set of digits are associated with a
>script, a registry-level restriction on mixing scripts (whether
>strictly using a Unicode definition of script boundaries or
>allowing some flexibility based on popular usage) is almost
>certainly sufficient to deal with any digit issues.  Without
>judging (in this paragraph) where any issues should be resolved,
>digits are special only if they are either not associated with a
>script or if a given script is associated with more than one set
>of digits.   We have agreed, at least so far, that any
>prohibitions on mixed scripts, as mixed scripts, in labels are a
>registry problem and one that, because of many different local
>conventions that mix Unicode Scripts in local practice, cannot
>be resolved in the Protocol and Tables even if we wanted to do
>that.  However, for Arabic script, a "one script" rule is not
>sufficient because the Arabic script category (and blocks)
>contains two full sets of digits.  It appears to be unique among
>scripts in that regard.

The Han block also contains more than one set of digits
(there are simple Han digit characters and then there are
those used on checks and so on to avoid forging amounts
by just adding another stroke,...). But these, in terms
of properties, aren't even labeled as digits, and don't
cause any input or confusability problems that I could
immagine.

I think that what we learned from this discussion is that
digit series cannot simply be treated as part of a script,
but have to be selected and treated as separate sets,
combined in ways useful to the relevant communities, by
registries.

>(3) The compelling issue that may justify some protocol-level
>treatment here is summarized in one sentence of one of Abdulaziz
>Al-Zoman's notes (although others have said much the same
>thing): "Users type digits without knowing the internal coding
>used".

Yes. Most of your mail is some intelligent guess about the
history and reasons for this fact [elided], but I think we can easily
accept this as a fact.

>In retrospect, I assume that the above is the reason for the
>initial recommendation from the Arabic Language WG that all
>digits be mapped into European ones by the protocol (following
>the Microsoft practice).

Very much so. What seems to have happened is that somebody
told some people in that language community that mapping isn't
really an option, and that the discussion then drifted towards
other kinds of protocol restrictions which would be easier to
make in the IDNA2008 framework, and that way, the discussion
got quite muddy.

>   Of course, to be really practical, if
>Microsoft actually codes _all_ digits to European ones, that
>convention would need to be applied to any digit in any script.
>That is clearly implausible at this stage, even if, in
>retrospect, it turned out to be the right thing to do.

Yes. In addition to the issue that the IDNA2008 framework
tries to get rid of mapping, there is a backwards-compatibility
issue (even for Arabic digits of either kind).

[As an asside, I think the issue is less hot in other regions
because these mostly use European numerals for calculations
or use several series side-by-side, which implies that the
system cannot easily internally re-map them. That would certainly
be the case for Japan, where except for spreadsheats, which
handle the differences as a presentation issue, the difference
is reflected in the encoding.]

>Viewed this way, the issue is not about what one vendor has or
>has not done, but about different vendors examining the
>tradeoffs between what we have called "digits are digits" and
>exact representation of characters without information loss and
>reaching different conclusions.  Those tradeoffs include the
>various possible decisions about what code points one actually
>gets when typing specific characters on a given keyboard setup.
>To the extent to which that is the case, it makes sense for us
>to do whatever helps maximize registry choice about what should
>match and what should not.

The best way to maximize registry choice is to document the
issues. I do not think that e.g. a protocol-level prohibition
of per-label digit series mixing would help any registry to
realize that they have to do parallel registration.

>Note that this issue is only very indirectly related to visual
>confusability, phishing, etc.  If it were only about those
>issues, or about so-called homographs, it would not be worth
>having except possibly as recommendations to registries.  It is
>a fundamental issue about the integrity of identifiers, the
>ability to access domain names from within a single script and
>language, and so on.
>
>However, a prohibition on digit-type-mixing doesn't help us very
>much in the example above:  If U codes digits strictly in
>Arabic-Indic form and W codes them strictly in European form,
>then they aren't going to be able to communicate regardless of
>any prohibitions on mixing (regardless of where those
>prohibitions occur and are enforced).  While that is not an
>argument against a prohibition -- the inability to solve all
>possible problems should not prevent us from solving those that
>we can--

The above is also not an argument for prohibition. There
are still no problems that can be solved by a protocol-level
prohibition that cannot be solved by a registry-level
prohibition.

>it does imply that we need to be careful about our
>expectations.   Communication will be possible only if the
>relevant registries enter labels with the same ownership and
>destinations

Yes, parallel registration.

>and if any testing on equivalence of URIs
>recognizes the issues and deals with the two forms as equivalent.

Testing of equivalences of URIs is not a black-and-white business!
I suggest people go and read Section 6 of RFC 3986, and
the IRI-specific parts of Section 5 of RFC 3987.

>(5)   Independent of the internal coding issues described above,
>we have cases in both the Western Arabic Script world (e.g.,
>Arabic Language) and the Eastern one (e.g., Pakistan and, I
>gather from Sarmad's comments, several languages) in which
>European digits are used interchangeably with local Arabic-Indic
>ones (Arabic-Indic digits in the West and Extended Arabic-Indic
>ones in the East).  But we have no users of Arabic script who
>have described a _requirement_ case for mixing either
>Arabic-Indic digits or either one with European digits in the
>same label (independent of speculation about what some people
>might want to do).

Agreed.

>(6) Several notes indicated that we had a protocol-level rule
>against mixing scripts.   There must be some confusion, because
>there is no such rule.  Indeed, it would be nearly impossible to
>write in a general way because, for some languages and
>registries, script-mixing is fairly common (and not particularly
>confusing).  Use of Romanji (Latin Characters) embedded in
>Japanese strings is one common example.

True.

>Is that a reasonable enough summary of the discussions that we
>can start to use it as a basis for making some final (insofar as
>things are ever final) decisions?

Definitely a step forward. I think at least on the main IDNA
list, a lot of the confusion was due to the fact that the
problem(s) were not described very well initially.

I also have given my recommendations for decisions inline,
as well as in an earlier mail today.

Regards,    Martin.

>       john
>
>
>p.s. I've also found some confusion on the list as to whether
>the proposal was about "protocol" versus "contextual" rules
>versus "registry restrictions".   The first two are the same,
>just different ways of expressing how the data are organized.
>Registry restrictions are, of course, different.
>
>_______________________________________________
>Idna-update mailing list
>Idna-update at alvestrand.no
>http://www.alvestrand.no/mailman/listinfo/idna-update

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp