Unicode 7.0.0, (combining) Hamza Above, and normalization for comparison

Thu Aug 7 18:09:40 CEST 2014

--On Wednesday, August 06, 2014 14:01 +0200 JFC Morfin
<jefsey at jefsey.com> wrote:

> At 07:03 06/08/2014, Patrik Fältström wrote:
>> To be honest, I do not think it matters where it is discussed.
> 
> I suggest we keep it discussed here. The reason why is the
> ICANN response to the plaintiffs in the .ir, etc. case. "the
> DNS provides a human interface to the internet protocol
> addressing system". This seems to be a good definition to
> commonly sustain as it is technically true, easy to
> understand, and makes a clear distinction between the human
> and the non-human issues.
>...

Jefsey,

I am not sure I understand what you are talking about but, if I
do, it is about an almost completely different topic.  

The actual issue here is extremely narrow and not associated
with subjective (or computable via a distance function) visual
conformability at all.  The issue also applies to a very small
(compared to the total Unicode repertoire) number of characters.
It also involves the relationship between characters/ code
points within a single script, while most conversations about
confusability have related to characters (more specifically,
code point sequences) within a given script.

In the hope that we can at least all be talking about the same
thing, let me try to summarize the issue with the hope that Mark
and I can at least agree about the summary.  At the risk of
being harsh, while I think more informed discussion of the
issues would be helpful, getting to an informed discussion
requires some serious effort to understand the Unicode Standard,
its construction, and its specific treatment of Arabic.  The
explanation below may be helpful to those who have at least a
large fraction of that understanding; those who don't have that
much would be, IMO, well-advised to do some serious reading
before trying to participate.

In some situations, the Hamza combining character is used with a
base character as a pronunciation indicator.  I'm told that, for
the "BEH" base character and a few others, the most common such
use is when Arabic (language) words are written in a
Perso-Arabic context or similar writing system environments.
Hamza is less often used this was for writing contemporary
Arabic language in an Arabic language context, but that usage
has changed over time.   That usage as a pronunciation indicator
has been supported in Unicode for years and years by a combining
sequence using the base character and Hamza Above (a combining
character).

In the lead-up to Unicode 7.0.0, the Unicode Consortium
apparently got a request to include some characters that are
needed for a North African language that is sometimes written in
Arabic script.  While it looks just like the BEH WITH HAMZA
ABOVE combining sequence (and Unicode even decided to give it
that name), it is really a conceptually separate character.   

I think there is no disagreement up to that point, including
about the abstract form of the combining sequence looking "just
like" the newly-assigned character.   Again, this is within the
same script and there are no issues about what is confusingly
similar and what is not.

There are several (I think equivalent) versions of where the
disagreement sets in, but let me try what seems today to be the
most clear.

Section 2.2 of The Unicode Standard seems to be quite clear that
coding in the standard is independent of language,  and similar
considerations (see, in particular, the subsection titled
"Unification").   Some of us who read that believe that a new
code point for this letter should not be assigned at all and
that, if it is, it should be subject to other rules that
decompose it back into the combining sequence.  However, section
2.2 makes clear that there are multiple considerations or
"design principles" (it lists ten of them), that it may not be
possible to apply them all and get a consistent result in a
particular case, and that it is necessary to strike a balance.
Presumably on the basis of that balance, and with the precedent
of at least three other characters in the Arabic script that are
also distinct characters for some languages but combining
sequences (if used at all) for others, a new code point, U+08A1,
was assigned without a decomposition back to the existing
combining sequence.

Mark (and others associated with the decision) cite the language
issues, the precedents in the other Arabic characters, and so
on.   Those of us who aren't enthused about the new character at
all but who are really concerned about two code point sequences
that, once language identification or considerations are
removed, yield the identical characters are very concerned that
normalization doesn't create an equality relationship.  

IMO, that can't be "resolved" or consensus reached because the
criteria for making the decision are different and lead to
different results.  There are still parts of the criteria that
Unicode is applying that confuse me (with frequent examples
about use of Latin script among European languages and even some
Latin characters being used to denote rather different phonemes
in Hanyu Pinjin than they denote in Western Europe as examples
of the confusion).  I'd like to be wrong.   But, if I'm not,
then we have a mechanism in IDNA for dealing with newly-added
Unicode code points that are problematic for IDNA.   It seems to
me that there is little question that this new character (and at
least its three predecessors) are problematic for IDNA (Mark may
disagree, but I don't think he has said that yet).  The question
then becomes whether the damage that would be done by just
accepting the Unicode decision and allowing U+08A1 to be PVALID
would be greater or less than the potential damage from
excluding it or writing special rules, rules that might, in some
ways, parallel the discussion of Hamza in the "Arabic" section
of The Unicode Standard.

Again, more general discussion about confusable characters,
especially between scripts, are relevant, but not to this thread
and discussion and maybe not in/to this WG or the IETF.  As an
aside, if you dig back through the literature in optical (or
equivalent) character recognition in the pre-Kurzweil period,
you will find quite a bit about abstract character properties
and distance functions that might explain why the latter never
worked very well even with a character repertoire limited to a
single language.  Some things have clearly changed in more than
a half-century, but a lot has not.

best,
    john