Tamil Numerals in IDNA - Re: WG Last Call for Four Primary IDNABIS I-Ds

Sat Aug 22 05:40:13 CEST 2009

Hi.

Thanks for the clearly-written note.  I think several of your
conclusions about this WG's work are incorrect.  That is my
opinion.  Others may disagree.  See details below.

--On Saturday, August 22, 2009 07:29 +0530 Gihan Dias
<gihan at cse.mrt.ac.lk> wrote:

> Thanks to everyone who responded to our request.
> 
> Let me summarise the current position from our point of view.
> 
> 1. One of the principal objectives of IDNA is to avoid
> registration of  labels which may cause problems, including
> being visually similar to  other labels.

Actually not.  This has often been an objective in ICANN and
other discussions, but this working group has never had that
aspiration.  The problem is that variations in font designs and
variability in human perception make it basically hopeless,
especially when scripts are unfamiliar to the would-be reader or
there is very little context (as in a domain name consisting of
a small number of characters that may not constitute a word in
any language).

> 2. There are characters in many scripts which are visually
> similar to a  character in another script. IDNA2008 does not
> handle such cases.  Registries which allow mixed-script labels
> should take appropriate steps  to avoid such confusion.

That is certainly correct.  Many of us also recommend that
registries should, if possible, avoid mixed-script labels
entirely.  But, as you understand, those recommendations are not
requirements of IDNA2008.

> 3. The WG has identified that visually similar characters in
> the *same*  script should generally be avoided. In most cases,
> this is achieved by  applying Unicode properties.

Actually, the Unicode properties are not of much help in this
area.  If for example, the two visually similar characters are
letters in the same script, the Unicode properties will
generally be the same.  Most of the code points that are
eliminated by the use of Unicode properties are symbols,
punctuation, and so-called compatibility characters.  The latter
may not be visually similar at all.

> Where this
> is not possible, as with  ARABIC-INDIC DIGITS and EXTENDED
> ARABIC-INDIC DIGITS, special rules (A.8  and A.9) have been
> introduced in idnabis-tables.txt .

While those special rules may also help with visual confusion,
they arose from considerations of different behaviors in the
"bidi" procedure for handling strings containing right-to-left
characters and, to some degree, possible localization behavior
in some operating systems.  The primary motivation for the WG
was not, at least in my opinion, visual similarity.

> 4. Registries - especially gTLDs - cannot be expected to be
> experts on  each script they register, but expect the RFCs to
> provide guidance on  this matter. It is not reasonable to
> expect registries to get together  outside the IETF process
> and form standard sets of rules for each  script. I believe
> that this should be done by IETF.

This is a subject on which reasonable people can disagree, but,
from my perspective, part of the problem is that the IETF
doesn't have the expertise either and does not have organized
mechanisms for reaching out to language communities.  The
Rationale document does recommend that registries, even gTLD
ones, not get involved with scripts that they do not understand;
at some level, that is exactly the expectation that you indicate
cannot be expected above.

What some language communities have done is to develop
recommendations about the use of the relevant scripts in IDNs
and request publication of those recommendations as
informational RFCs.   RFC 4713 is an example of this.  A similar
document for the use of the Arabic language is in the RFC
Editor's publication queue.  If you feel that a similar document
would be useful for Tamil, contact me off-list and I'll try to
help with the process.

> 5. Some Tamil digits are very similar to Tamil letters or
> syllables (see  image in my 1st message).

I think we understand and appreciate that.  As you may be aware,
there are very long-standing examples in basic Latin-script
characters that are similar in many fonts, for example the
similarity between the digit "1" and the small letter "l".  You
have the advantage that, because the digits are not in use, it
should be fairly straightforward to construct registry
restrictions against using them.  Registries primarily concerned
with registration of traditional labels do not have that
advantage.

> 6. Tamil digits are not in contemporary use in India, Sri
> Lanka or  elsewhere (I think other WG members can verify this).

ok.

> 7. If Tamil digits were a specific Unicode block, or had an
> identifiable  Unicode property,  the WG would be inclined to
> accede to our request.  However, the WG is disinclined to
> disallow characters on a case-by-case  character analysis
> [this is my reading of the comments received].

I don't know whether the WG would accede to the request if the
decision by property or block or not.   The debate about some of
the special-cases that have been dealt with in the past have
been quite extended and heated.  Perhaps the best example, which
was easily handled on an algorithmic basis once the WG decided
to do it, is the restriction on Korean Jamo.  Certainly the WG
is less inclined to disallow characters on an exception list
basis than to do so on an algorithmic (property combination or
block) one.  But I think most of us consider disallowing any
letter or digit (or group of them) to be a fairly major
decision, so any such proposal would be likely to meet
considerable resistance.

> 8. Tamil digits have Unicode property "Nd" (decimal numeral)
> and are in  the "Tamil" block, and thus cannot be easily
> differentiated by a rule.  The only way to treat 0BE6..0BEF as
> DISALLOWED is to add them to the  exceptions table one by one.
> I.e. add them to section "2.6. Exceptions  (F)" with the
> explicit value DISALLOWED.

That is correct.

> I believe that this case is similar to the ARABIC-INDIC
> DIGITS, and  should be treated similarly. However, in this
> case, the solution is  simpler, as the characters need only be
> DISALLOWED and no contextual  rule is needed.

As discussed above, the reasons for giving special attention to
the ARABIC-INDIC DIGITS was less visual similarity but some
interactions with the "bidi" rules and the fact that they
represent the only case I'm aware of in which there are two
distinct sets of digits within the same script.

I also note that you wrote...

> P.S. I have not addressed other Southern Indic digits.

While that is reasonable from your standpoint, another issue
with the possibility of disallowing Tamil digits is that, for
consistency and predictability, I think we really would need to
examine all of those scripts, and perhaps other scripts from
around the world where traditional digits have been replaced by
European ones, to see if similar restrictions were appropriate.
The other thing that makes the [western] ARABIC-INDIC DIGITS
interesting in this regard is that we did consider a proposal to
disallow those digits in favor of allowing European digits only.
After extended discussion and case analysis, the WG decided that
proposal was not plausible given the range of uses of the script
and the traditional digits (at least that is how I remember the
discussion).

Personally, I still haven't formed an opinion on this although
it is probably clear how I'm leaning.  From my point of view at
the moment, disallowing these digits, especially at this late
stage in the WG's work, is problematic enough that we should do
it only if at least one, and probably both, of two conditions
are met:

	(1) It is demonstrated that adequate protection cannot
	be obtained with registry restrictions, even after a
	reasonable effort has been made to inform registries
	that are not deeply familiar with Tamil.  I believe that
	reasonable effort should include registration of
	Language-Script tables with ICANN as Cary Karp suggested
	and might include publication of an Informational RFC
	that explains the issues.

	(2) The potential for far more harm is demonstrated than
	can be accounted for by visual similarity alone.

Again, this is just my personal opinion -- I'm not in any way
speaking for the working group.

sincerely,
  john