Follow-up from Tuesday's discussion of digits in the Latin and Arabic Script blocks

Tue Dec 2 21:02:34 CET 2008

All,

For those not present during the 2nd session at Minneapolis, a proposal 
was offered concerning digits within labels. I'm not advocating for it. 
Warning, long.

As background, and nomenclature:

There are the digits 0..9 encoded in the ASCII Coded Character Set in 
the range 0x30..0x39, (and in other CCSs, e.g., EBCDIC)
There are the digits 0..9 encoded in iso8859-6 ("Arabic") Coded 
Character Set, also in the range 0x30..0x39, (and in other CCS, e.g., 
various IBM) and
There are the digits 0..9 encoded in other ("Farsi") Coded Character 
Sets, some in single range, some in dual ranges (again, see IBM for 
examples).

These sets of digits are incorporated into the Unicode coded character 
set as U+00C0..U+00C9, which are denoted "latin digits", U+0660..U+0669, 
denoted as "arabic-indic digits, and U+06F0..U+06F9, denoted as "eastern 
arabic-indic digits". Additionally, the bidirectional character types 
associated with these three ranges are "EN", "AN", and "EN", respectively.

The glyphs pairs at {U+0664 and U+06F4}, {U+0665 and U+06F5}, and 
{U+0666 and U+06F6} are visually dissimilar, and endearingly, the glyph 
at U+06F5 looks like a "heart", a point I will come back to. The 
remaining glyph pairs are not visually dissimilar, a point which 
motivates the proposal at hand.

The proposal was to restrict to only one of these three ranges the 
digits which could be in a label. Three adjacent labels forming part of 
a FQDN could, under this proposed restriction, each have digits from 
each of these three ranges, so long as no label contained digits from 
more than  one range. The text of the proposal follows:

Begin Quoted Material {

*Numerals*

The ASIWG agreed that in the case of the three sets of Arabic Numerals, 
the goal is to ensure uniqueness and homogeneity of use of numerals.  It 
is also agreed that we cannot ban any of the three sets of numerals at 
the protocol level, but that it is advisable to restrict the 3 sets of 
numerals from mixing with each other at the protocol level.

_Suggested Protocol rule (to be submitted to IETF upon agreement in 
ASIWG)_:

Three sets of numerals are used by the Arabic script community: Arabic 
Indic (U+0660..U+0669), Eastern Arabic-Indic (U+06F0..U+06F9), European 
(U+0030..U+0039).

The mixing of these numerals with each other results in security 
issues.  The Arabic script using language communities represented in 
ASIWG do not know of any context where multiple sets of numerals are 
concurrently used in domain name labels.

Therefore, ASIWG requests the creation of a rule at the IDNA protocol 
level (potentially implemented in BIDI): If a numeral from the Arabic 
Indic or Eastern Arabic-Indic sets appears in a label, numeral 
homogeneity is required.

Assuming implementation of the numeral homogeneity rule at the IDNA 
protocol level, ASIWG has recommended to registries implementing Arabic 
script to bind labels containing numerals to the other sets of numerals, 
thereby limiting the total number of labels to only three per active 
label containing a numeral.

_ _

_Registry Recommendation_: For whichever active label that contains a 
numeral, the registry must register the other two sets and bind them to 
go to the same destination as the active registration.

*_
_*} End Quoted Material

The above proposal would affect the use of digits from the Arabic 
Script, and would affect users of languages which use this script, the 
twenty two members of the League of Arab States, for which Arabic is an 
official language, users in many Western and Central Asian states, for 
which Farsi, Dari, Urdu are official languages and Arabic is also used, 
and areas of Esat Asia and Sub-Saharan Africa, where several official 
languages are written using Arabic Script, and Arabic is also used, and 
in Europe and the Americas. It is a wide-ranging proposal, and within 
the affected user communities there are pre-existing conventions with 
distinct scopes. "Eastern arabic-indic digits" are used in Farsi, Dari, 
Urdu, ... "Latin digits" are used throughout the League of Arab States, 
except those influenced by the pan-arab movement lead by Nasser in the 
60's, Egypt, and its neighbors.

My concerns with this proposal are several, and I omit those having to 
do with the process by which the proposal itself came into existence.

My initial concern was that "latin" and "arabic-indic" digits co-exist 
in contexts similar to labels. The identifiers for automobiles (license 
plates) are dual-texts", however during the course of corresponding with 
several persons [1] two more fundamental issues emerged, which should 
not be overlooked before attempting to evaluate Tuesday's proposal.

First, the community which has been working on Arabic (and only Arabic) 
domain names has assumed that their work product is more restrictive 
than a "means to produce and use stable and unambiguous [...] 
identifiers" (from the IDNAbis charter). Their goal, paraphrased, is 
that "only what is necessary shall be allowed", and "domain names must 
be meaningful words". This is a restriction that is outside of the scope 
of the IDNAbis WG as chartered. When we are offered as a rational "... 
mixed digits are unnecessary ...", we are being asked to restrict the 
IDNAbis WG charter, for a script, but not all scripts. This community 
assumes that their rule set(s) will be made in the protocol, with the 
scope of affect on users I mention several paras above.

Second, the more recent community which has been working on Arabic 
Script, proposed at the New Delhi ICANN meeting and first meeting at the 
Dubai regional ICANN meeting and subsequently after the Cairo ICANN 
meeting, containing Arabic and Farsi and Urdu literate contributors, the 
immediate authors of Tuesday's proposal, also assumes that their rule 
set(s) will be made in the protocol, with the scope I mentioned three 
paras above.

Both sets of assumptions may prove to be correct, the IDNAbis WG may 
reach consensus to place rules that restrict to some definition of 
"necessary, but no more" Arabic language convention in the protocol, or 
to restrict visually indistinguishable Arabic Script character in the 
protocol, but I assume that this will not be the case, and for the 
moment, this is not the case.

With those pre-conditions addressed, and the location of any "rule" left 
undefined, whether "in the protocol" or "in the registry policy", there 
is the presenting issue. Whether "latin" and "arabic-indic" digits (or 
"latin" and "eastern arabic-indic" digits)  can co-exist in a single 
label, and for completeness, whether "arabic-indic" digits and "eastern 
arabic-indic" digits can co-exist in a single label.

For the 3rd of these three cases, the issue of visual similarity between 
"arabic-indic" digits and "eastern arabic-indic" digits exists only for 
the digits "0", "1", "2", "3", "7", "8", and "9", and any rule intended 
to prevent visual similarity would be restricted to those 7 values, not 
all 10, unless it were shown to be not technically feasible to properly 
scope the restriction. I assume that it is technically feasible to 
properly scope such a restriction.

The 1st and 2nd of these three cases are equivalent, so I will only 
address the first one. After significant correspondence, the rational 
offered for baning "latin" and "arabic-indic" digits in a label is that 
such strings are "culturally and linguistically awkward", a property 
shared by many strings composed from the LDH set of characters.

This means that what we are offered in Tuesday's proposal, "a bug" which 
necessitates some MUST NOT or MUST language somewhere, the 
policy-neutral protocol or policy-specific registry policy, is two 
"bugs", the first, visual similarity, the second, ugliness.

In my concluding note last week on the subject with my correspondents I 
wrote:

Begin Quoted Material {

... what we do [...] will have to work for a very large number of people 
over a very long time, many of whom live in latin predominant areas. 
What is inaesthetic now (and for what its worth, I find 2nd generation 
Arabic typography very inaesthetic) may not be in the future, and 
banning it is simply an error in contemporary typography, an avoidable 
protocol error, and an inappropriate use of IETF time.

Arbitrary rules are possible, and proper only, at the registry level.

} End Quoted Material

Which brings me to true love, or the similarity of the "eastern 
arabic-indic" character for the digit "5" and the apparently human glyph 
for "heart", and also "emoji" enjoyed by CJK script users, and 
conceivably by Cree Syllabics and other script users. I suppose an 
emoticon is appropriate here, so ";-)".

There is, in the .ir namespace, a label which contains 
"from-my-heart-to-your-heart", with each "eastern arabic-indic" digit 
"5" rendered (correctly) as a heart.

If the digit "1", in "latin", or "arabic-indic", or "eastern 
arabic-indic" is added to this existing label, to form 
"from-my-1-heart-to-your-1-heart", or 
"from-my-4-chambered-heart-to-your-three-chambered-heart" (a 
mammal-or-aves to reptile-or-amphibian relationship, two chambers if 
fish are romantically involved), no visual confusion results, except for 
the seven glyphs pairs at {U+066 and U+06F0}..{U+0664 and U+06F4} and 
{U+0667 and U+06F7}..{U+0669 and U+06F9}, corresponding to the digits 
"0", "1", "2", "3", and "7", "8", and "9", which is entirely tractable 
using the well-known "bundle" mechanism at the registry level.

Again, the question is whether it is technically infeasible to create 
rules which achieve a desired outcome and no more.

My conclusion is that Tuesday's proposal for a bidi rule banning mixed 
digits is not justified, and that weaker form(s) addressing the two, 
distinct rationals, similarity and aesthetics, may be justified in 
registry policy, not simply because the IDNAbis WG is presently in 
consensus that similarity (aka "phishing"), or more generally, policy, 
does not belong in the policy neutral protocol, but additionally because 
they are culturally-specific and not appropriate to be universally 
scoped. Further, the minimally required rules are technically feasible, 
independent of mechanism (protocol or registry policy).

I've intentionally not discussed an edge condition that is, in my 
opinion, the rational for another overly general rule, which is the 
interaction of the design failure of associating directionality of any 
type to 0x2e (dot) in every context, in particular as a separator of 
labels comprising domain names.

I hope everything I've written is in part familiar to everyone, and that 
at least one person, other than myself, learned one thing not previously 
known, in reading this note. I did while researching it. I thank 
everyone who's written me off-list. As always, I could be completely 
mistaken on every point of fact, and I attempt to convince no one of 
anything, only to state what I understand, and as usual, in a needlessly 
complex and dull manner.

Eric

[1] Ram Mohan, Manal Ismail, Sarmad Hussain, Ibaa Oueichek, Abdulaziz 
Al-Zoman, Raed Al-Fayez. Alieza Saleh, Siavash Shahshahani, and Ali 
Bouallou all provided responses.