Tamil Numerals in IDNA - Re: WG Last Call for Four Primary IDNABIS I-Ds

Fri Aug 21 17:55:43 CEST 2009

Elizabeth,

It appears that you are viewing all requests or suggestions from
language communities as equivalent to each other and, more
important, as a situation in which "...different French or Tamil
registries may adopt different mapping practices...".  In case
there is still confusion, let me stress two things:

(1) The specifications _forbid_ a registry doing _any_ mapping
(converting one character or code sequence into another) other
than

	(i) testing for and, if necessary, applying Unicode
	Normalization Form C (NFC)

	(ii) Converting one Unicode encoding form (e.g., UTF-8)
	into another (e.g., UTF-32) for consistency with
	internal character storage.

Registrations occur in terms of final characters only.  Of
course, the IETF, as a voluntary standards body, cannot enforce
that rule.  But it cannot enforce any other rule either.  Those
who do not follow the standards may encounter interoperability
problems and may be subject to other authorities.

(2) The request from the Sri Lankan Tamil community asks for
exclusion of a range of characters, i.e., classifying them as
DISALLOWED.  If they are DISALLOWED, then no registry is
permitted to register strings that contain them and no lookup
application should look up any string that contains them either.
That case produces absolutely predictable behavior.  It also
does not violate any fundamental rules of Unicode or of the
IDNA2008 protocol design -- an exception would just be made to
treat some collection of characters as DISALLOWED that would
otherwise be PVALID.

What you are asking for where French is involved is a very
different situation.  First, you want certain characters treated
differently for some languages that use a given script than for
others that use the same script.  That is nearly impossible to
think about, just because there is, in general, no way to know
what language a particular label is supposed to be associated
with, nor is there a way to know what top-level domain has the
label in one of its subtrees (even if one could reliably
associate top-level domains with languages).  Second, if I
understand your latest note correctly, you would like to have
those characters treated via some contextual rule ("CONTEXTO").
But the contextual rules yield either "valid" or "invalid" based
on adjacent or nearby characters -- they do not provide
different mappings, nor different rules for different languages
(the latter at least partially for the reasons above).  And,
finally, your suggestion requires treating capital letters (or
at least some capital letters) as distinct from their lower-case
forms, which would create massive inconsistencies with IDNA2003
(not just the two characters of inconsistency with which we have
have had such extensive debates) as well as inconsistencies with
DNS and host table practices that go back to the 1970s.  No
matter how strong your justification, and even if it were not
also tied to differential treatment for a particular language, I
cannot imagine the WG (or the IETF more broadly) agreeing to
that change.

Another part of the difference is that the Tamil script is used
to write only one language or, depending on how one counts, a
small collection of very closely related languages.   That makes
thinking about an exception request much easier than it is with
the Latin script, which is used to write a very large number of
languages, some of them with no recent (e.g., conservatively in
the last 3000 years or so) linguistic relationship to each other
and that use the script in different ways.  That is a
long-standing historical problem; there is nothing that the WG
can do about it today other than to recognize it and move on.

So I don't see the analogy you are drawing as being at all
accurate, nor does it appear to me to be helpful to dealing with
either the perceived Tamil numeral problem or the French
majuscule one.

I believe that your inferences about single "IDN tables", etc.,
are 

regards,
   john

p.s. My impression is that the WG is not going to accept this
change unless it can be demonstrated that these characters pose
serious risks (beyond confusability, even within the script) and
that the problems cannot be addressed by registry restrictions.
That would be consistent with other decisions in the past.
However, that is a personal impression as of right now.  It is
not a statement about the WG opinion, nor is it a prediction.
And it is certainly not a statement about my personal
preferences (in part because I don't have one yet).  Things
could change as discussion continues.

--On Friday, August 21, 2009 13:38 +0200 Elisabeth Blanconil
<eblanconil at gmail.com> wrote:

> Dear colleagues,
> 
> This is a collective mail which discusses a strategic
> non-disclosed IUCG proposition in the multilinguistics area.
> It has been reveiwed by the IUCG Chair and will be published
> on the IUCG site.
> 
> This is a new case of orthotypography based difficulty. The WG
> is in theory only interested by characters, translated in
> unicode points. However, in many cases natural languages are
> interfering or quoted in its Rationale. The WG must officially
> say how this is to be consistently addressed.
> 
> For exemple, in the WG definitly not considered French case,
> upper-cases that support majuscules are DISALLOWED. This is
> absurd. They should be CONTEXTO. This seems to be equivalent
> here. Otherwise, different French or Tamil registries may
> adopt different mapping practices, confuse users, and
> introduce semantic addressing related problems. This means
>...