Follow-up from Tuesday's discussion of digits in the Latin and Arabic Script blocks
Eric Brunner-Williams
ebw at abenaki.wabanaki.net
Tue Dec 2 21:02:34 CET 2008
All,
For those not present during the 2nd session at Minneapolis, a proposal
was offered concerning digits within labels. I'm not advocating for it.
Warning, long.
As background, and nomenclature:
There are the digits 0..9 encoded in the ASCII Coded Character Set in
the range 0x30..0x39, (and in other CCSs, e.g., EBCDIC)
There are the digits 0..9 encoded in iso8859-6 ("Arabic") Coded
Character Set, also in the range 0x30..0x39, (and in other CCS, e.g.,
various IBM) and
There are the digits 0..9 encoded in other ("Farsi") Coded Character
Sets, some in single range, some in dual ranges (again, see IBM for
examples).
These sets of digits are incorporated into the Unicode coded character
set as U+00C0..U+00C9, which are denoted "latin digits", U+0660..U+0669,
denoted as "arabic-indic digits, and U+06F0..U+06F9, denoted as "eastern
arabic-indic digits". Additionally, the bidirectional character types
associated with these three ranges are "EN", "AN", and "EN", respectively.
The glyphs pairs at {U+0664 and U+06F4}, {U+0665 and U+06F5}, and
{U+0666 and U+06F6} are visually dissimilar, and endearingly, the glyph
at U+06F5 looks like a "heart", a point I will come back to. The
remaining glyph pairs are not visually dissimilar, a point which
motivates the proposal at hand.
The proposal was to restrict to only one of these three ranges the
digits which could be in a label. Three adjacent labels forming part of
a FQDN could, under this proposed restriction, each have digits from
each of these three ranges, so long as no label contained digits from
more than one range. The text of the proposal follows:
Begin Quoted Material {
*Numerals*
The ASIWG agreed that in the case of the three sets of Arabic Numerals,
the goal is to ensure uniqueness and homogeneity of use of numerals. It
is also agreed that we cannot ban any of the three sets of numerals at
the protocol level, but that it is advisable to restrict the 3 sets of
numerals from mixing with each other at the protocol level.
_Suggested Protocol rule (to be submitted to IETF upon agreement in
ASIWG)_:
Three sets of numerals are used by the Arabic script community: Arabic
Indic (U+0660..U+0669), Eastern Arabic-Indic (U+06F0..U+06F9), European
(U+0030..U+0039).
The mixing of these numerals with each other results in security
issues. The Arabic script using language communities represented in
ASIWG do not know of any context where multiple sets of numerals are
concurrently used in domain name labels.
Therefore, ASIWG requests the creation of a rule at the IDNA protocol
level (potentially implemented in BIDI): If a numeral from the Arabic
Indic or Eastern Arabic-Indic sets appears in a label, numeral
homogeneity is required.
Assuming implementation of the numeral homogeneity rule at the IDNA
protocol level, ASIWG has recommended to registries implementing Arabic
script to bind labels containing numerals to the other sets of numerals,
thereby limiting the total number of labels to only three per active
label containing a numeral.
_ _
_Registry Recommendation_: For whichever active label that contains a
numeral, the registry must register the other two sets and bind them to
go to the same destination as the active registration.
*_
_*} End Quoted Material
The above proposal would affect the use of digits from the Arabic
Script, and would affect users of languages which use this script, the
twenty two members of the League of Arab States, for which Arabic is an
official language, users in many Western and Central Asian states, for
which Farsi, Dari, Urdu are official languages and Arabic is also used,
and areas of Esat Asia and Sub-Saharan Africa, where several official
languages are written using Arabic Script, and Arabic is also used, and
in Europe and the Americas. It is a wide-ranging proposal, and within
the affected user communities there are pre-existing conventions with
distinct scopes. "Eastern arabic-indic digits" are used in Farsi, Dari,
Urdu, ... "Latin digits" are used throughout the League of Arab States,
except those influenced by the pan-arab movement lead by Nasser in the
60's, Egypt, and its neighbors.
My concerns with this proposal are several, and I omit those having to
do with the process by which the proposal itself came into existence.
My initial concern was that "latin" and "arabic-indic" digits co-exist
in contexts similar to labels. The identifiers for automobiles (license
plates) are dual-texts", however during the course of corresponding with
several persons [1] two more fundamental issues emerged, which should
not be overlooked before attempting to evaluate Tuesday's proposal.
First, the community which has been working on Arabic (and only Arabic)
domain names has assumed that their work product is more restrictive
than a "means to produce and use stable and unambiguous [...]
identifiers" (from the IDNAbis charter). Their goal, paraphrased, is
that "only what is necessary shall be allowed", and "domain names must
be meaningful words". This is a restriction that is outside of the scope
of the IDNAbis WG as chartered. When we are offered as a rational "...
mixed digits are unnecessary ...", we are being asked to restrict the
IDNAbis WG charter, for a script, but not all scripts. This community
assumes that their rule set(s) will be made in the protocol, with the
scope of affect on users I mention several paras above.
Second, the more recent community which has been working on Arabic
Script, proposed at the New Delhi ICANN meeting and first meeting at the
Dubai regional ICANN meeting and subsequently after the Cairo ICANN
meeting, containing Arabic and Farsi and Urdu literate contributors, the
immediate authors of Tuesday's proposal, also assumes that their rule
set(s) will be made in the protocol, with the scope I mentioned three
paras above.
Both sets of assumptions may prove to be correct, the IDNAbis WG may
reach consensus to place rules that restrict to some definition of
"necessary, but no more" Arabic language convention in the protocol, or
to restrict visually indistinguishable Arabic Script character in the
protocol, but I assume that this will not be the case, and for the
moment, this is not the case.
With those pre-conditions addressed, and the location of any "rule" left
undefined, whether "in the protocol" or "in the registry policy", there
is the presenting issue. Whether "latin" and "arabic-indic" digits (or
"latin" and "eastern arabic-indic" digits) can co-exist in a single
label, and for completeness, whether "arabic-indic" digits and "eastern
arabic-indic" digits can co-exist in a single label.
For the 3rd of these three cases, the issue of visual similarity between
"arabic-indic" digits and "eastern arabic-indic" digits exists only for
the digits "0", "1", "2", "3", "7", "8", and "9", and any rule intended
to prevent visual similarity would be restricted to those 7 values, not
all 10, unless it were shown to be not technically feasible to properly
scope the restriction. I assume that it is technically feasible to
properly scope such a restriction.
The 1st and 2nd of these three cases are equivalent, so I will only
address the first one. After significant correspondence, the rational
offered for baning "latin" and "arabic-indic" digits in a label is that
such strings are "culturally and linguistically awkward", a property
shared by many strings composed from the LDH set of characters.
This means that what we are offered in Tuesday's proposal, "a bug" which
necessitates some MUST NOT or MUST language somewhere, the
policy-neutral protocol or policy-specific registry policy, is two
"bugs", the first, visual similarity, the second, ugliness.
In my concluding note last week on the subject with my correspondents I
wrote:
Begin Quoted Material {
... what we do [...] will have to work for a very large number of people
over a very long time, many of whom live in latin predominant areas.
What is inaesthetic now (and for what its worth, I find 2nd generation
Arabic typography very inaesthetic) may not be in the future, and
banning it is simply an error in contemporary typography, an avoidable
protocol error, and an inappropriate use of IETF time.
Arbitrary rules are possible, and proper only, at the registry level.
} End Quoted Material
Which brings me to true love, or the similarity of the "eastern
arabic-indic" character for the digit "5" and the apparently human glyph
for "heart", and also "emoji" enjoyed by CJK script users, and
conceivably by Cree Syllabics and other script users. I suppose an
emoticon is appropriate here, so ";-)".
There is, in the .ir namespace, a label which contains
"from-my-heart-to-your-heart", with each "eastern arabic-indic" digit
"5" rendered (correctly) as a heart.
If the digit "1", in "latin", or "arabic-indic", or "eastern
arabic-indic" is added to this existing label, to form
"from-my-1-heart-to-your-1-heart", or
"from-my-4-chambered-heart-to-your-three-chambered-heart" (a
mammal-or-aves to reptile-or-amphibian relationship, two chambers if
fish are romantically involved), no visual confusion results, except for
the seven glyphs pairs at {U+066 and U+06F0}..{U+0664 and U+06F4} and
{U+0667 and U+06F7}..{U+0669 and U+06F9}, corresponding to the digits
"0", "1", "2", "3", and "7", "8", and "9", which is entirely tractable
using the well-known "bundle" mechanism at the registry level.
Again, the question is whether it is technically infeasible to create
rules which achieve a desired outcome and no more.
My conclusion is that Tuesday's proposal for a bidi rule banning mixed
digits is not justified, and that weaker form(s) addressing the two,
distinct rationals, similarity and aesthetics, may be justified in
registry policy, not simply because the IDNAbis WG is presently in
consensus that similarity (aka "phishing"), or more generally, policy,
does not belong in the policy neutral protocol, but additionally because
they are culturally-specific and not appropriate to be universally
scoped. Further, the minimally required rules are technically feasible,
independent of mechanism (protocol or registry policy).
I've intentionally not discussed an edge condition that is, in my
opinion, the rational for another overly general rule, which is the
interaction of the design failure of associating directionality of any
type to 0x2e (dot) in every context, in particular as a separator of
labels comprising domain names.
I hope everything I've written is in part familiar to everyone, and that
at least one person, other than myself, learned one thing not previously
known, in reading this note. I did while researching it. I thank
everyone who's written me off-list. As always, I could be completely
mistaken on every point of fact, and I attempt to convince no one of
anything, only to state what I understand, and as usual, in a needlessly
complex and dull manner.
Eric
[1] Ram Mohan, Manal Ismail, Sarmad Hussain, Ibaa Oueichek, Abdulaziz
Al-Zoman, Raed Al-Fayez. Alieza Saleh, Siavash Shahshahani, and Ali
Bouallou all provided responses.
More information about the Idna-update
mailing list