FW: Your statement on Identifiers and Unicode 7.0.0

Thu Feb 5 19:45:45 CET 2015

--On Thursday, February 05, 2015 06:48 +0000 Raed Al-Fayez
<rfayez at citc.gov.sa> wrote:

> Dear John,
> 
> Sorry! ... I can see that you might be irritated by the
> concept of excluding the "non-combining characters" when you
> linked it to Latin script. BTW, the context of our request was
> for the Arabic Script. We tend not to generalize our finding
> to other scripts.

Concerned, not irritated.  I think "not generalizing" is
entirely appropriate, I just wish that you (and many others)
would be specific when you are doing that, if only to avoid your
statements as being taken, by others, as supporting scripts and
languages for which you would not claim to be speaking and, even
more important in this context, as support for ideas that I'm
quite certain neither of us understand.

Whether "no combining characters" is an appropriate rule or not
for languages written in the Arabic script that have very
different phonetic structure than the Arabic language -- whether
those Arabic script is in active contemporary use for for those
languages or not -- is far beyond my competence to judge
(although not beyond my competence to be concerned that
important cases may exist).  Fortunately, at least for those
languages that are primarily or exclusively written in the
Arabic script today, users and experts are available to speak
up.  I can only hope that they do so.

> Again, our view is that, the current IAB statement recommended
> to exclude many code points from the Arabic script; at least
> three of those code points are essential code points for a
> number of widely used languages in the Arabic script ( Arabic,
> Farsi, Urdu, Jawi, Pashto ..etc). Almost all Arabic script
> IDN-TLDs are using them in domain name registrations! So,
> IAB's recommendations affects so many users and domains
> without any logic; since some of the excluded code points
> already have suitable normalization rules in place! 

As I have said before, I think the IAB statement has to be
considered in its entirety, as a warning about a problem that
was, when it was composed, only partially understood (I think it
is _still_ only partially understood, but that it another
matter).  While its apparent focus on Arabic and Arabic examples
is unfortunate, the IAB (and the IETF as its work on identifiers
for use in protocols more generally) are as much a victim of the
circumstances that identified the situation as the Arabic script
is: 

	* While IDNA2008 was still being designed, the WG was
	told that the problem of identical-appearing characters
	within a given script that would not be treated as equal
	after normalization did not exist; 

	* the problem was detected because Unicode 7.0.0 added a
	new code point that did not decompose into the existing
	combining sequences despite the WG (and the broader
	community) being told that no more such characters would
	be added; and 

	* the first reactions from the Unicode technical
	community were not "you were misled and character
	assignments like this exist for code points scattered
	all over Unicode, including ones assigned before Unicode
	5.0, ones assigned more recently, and we expect to
	assign more in the future".  Instead, we were given
	pointers to two separate subsections of the Arabic
	discussion in the Unicode Standard (Section 9.2 in
	Unicode 7.0).  Those subsections could reasonably be
	interpreted by a person with little or no familiarity
	with Arabic (a group that certainly includes most of the
	IAB) as "as least as treated in Unicode, Hamza is a
	problem and turns anything it touches or appears with
	into a bigger problem".  

>From that point of view --and even from the perspective of
whether combining characters for Arabic should have been
assigned code points at all-- your fundamental problems are with
Unicode, how it handles Arabic, and perhaps the advice the
Standard gives about Arabic in various versions of the Standard.
Blaming those issues on the IAB, or on this statement, isn't
going to help very much because, even if the IAB updates and
clarifies the statement to your satisfaction, it won't change
the underlying problems.

Personally and in retrospect, I wish that the IAB statement had
been a little more careful to explain that the list of
characters it included was not all one group, posing equal
danger.  But I'm not sure it would have made any difference
given the number of people who have responded to it by
complaining about particular aspects of it without showing
understanding of the rest of the statement.   That was, if I
recall, one of the IAB's concerns -- that the longer and more
precise they made the statement, the fewer people would actually
read and try to understand all of it rather than just finding
something to which to react or reacting on the basic of comments
made by others.  In retrospect, I don't think they got the
balance right.  But I'm not sure that "right" was actually
possible.

> Personally, I am against the concept of excluding any code
> points without a full study for the root cause of the problem
> and after consulting experts from that Scripts. 

Again, speaking more broadly than Arabic, a strategy that does
not exclude any code point without full study and deep
understanding takes us into a realm in which we would have to
regularly invalidate existing code points.  No one I've talked
with thinks that is a good idea; some believe it should be made
impossible no matter how severe the possible risks are
understood to be once the characters are more fully understood.
The latter would, in practice, never allow a character or code
point to be excluded.

That is one of the reasons IDNA2008 went to an inclusion-based
model rather than the exclusion-based one that dominated the
thinking about IDNA2003.  Those of us who were involved in the
earliest design stages that led to IDNA2008 even proposed that
the current "PVALID" category consist of two parts: characters
the community was confident, after appropriate study, were ok
and characters that were probably ok but that had not yet been
studied sufficiently.   The latter were recommended for use only
within communities that were sure they understood them and for
registrants who were willing to assume the risk of their labels
being invalidated if problems were discovered.  In retrospect,
it was probably a good idea.  Certainly combining characters
with the Arabic script would have ended up on the second list
initially and might have then been more easily moved to
"disallowed"  if study efforts indicated they were not needed
for any language written in Arabic script (or if the Unicode
Consortium made it clear that they were willing to add more
precomposed character code points as needed).

> However,  if
> the IAB intention was just to raise a warring to the
> community, then I believe that they should exclude the
> "non-combining characters" in the Arabic script (BTW, no
> Arabic IDN TLD registry use non-combining characters till now)
> rather than excluding some essential letters (that does not
> have any problems and is used by many Arabic IDN TLD
> registries).

See above.

best regards,
    john