Unicode 7.0.0, (combining) Hamza Above, and normalization

Fri Aug 8 01:23:20 CEST 2014

--On Thursday, August 07, 2014 21:47 +0000 "Whistler, Ken"
<ken.whistler at sap.com> wrote:

> Paul Hoffmann asserted:
> 
>> Right. To me, the current processing under NFC is the wrong
>> result. Andrew was a bit polite at the end of his message,
>> but it sounds to me that he thinks the NFC processing for the
>> new character leads to the wrong result when compared to
>> earlier NFC processing.
> 
> The issue for the table update comes down to that.

Indeed.

> I think it is quite clear, however, that it is not the case
> that "the current processing under NFC is the wrong result".

I would not have phrased it the way Paul did, partially because
I believe it is perfectly possible to believe (despite my
concerns about the statements, especially about languages and
unification within a script, in Section 2.2 of The Unicode
Standard) that exactly the right decisions have been made for
Unicode and about the language(s) involved while also believing
that the consequences of those decisions are wrong for IDNA
which has absolutely no language context or idea of what that
might mean.

To state that differently and more strongly, it is entirely
consistent with the IDNA design for someone to say "that is
irrelevant" every time someone says something about about the
needs, implications, or character usage of a particular
language, at least within the context of a given script that
language shares with other languages.  I think almost everyone
so far has been trying to avoid getting into a position that
extreme (which I, at least appreciate) but a number of
statements about languages and how the characters are used leads
me to believe that we (that is two communities, not just two
individuals) are just not communicating with each other.

> The premises of this argument all come down to implicit (or
> occasionally explicit) assertions that the beh-with-hamza
> encoded for the Fula implosive b is the *same* character as an
> existing Arabic beh character followed by the combining Hamza
> mark.
> 
> They are *NOT* the same. And *if* they are not the same, all
> the arguments about NFC being wrong, etc., are pointless.

Again, they can be "not the same" for Unicode purposes and "the
same" for IDNA ones.  

> These implicit assertions that the beh-with Hamza and the
> sequence *ARE* the same are as beside the point as heading
> down the road of citing any number of other possible once
> similarities in appearance: for example, claiming that U+063A
> ARABIC LETTER GHAIN is the *SAME* character as U+0639 ARABIC
> LETTER AIN + U+0307 COMBINING DOT ABOVE sequence, because the
> atomic character and that sequence might look the same.

Speaking personally and as the IDNA person who first noticed the
potential issue, I would think your case would be much stronger
if either:

(1) You or your colleagues showed us code tables and/or strings
of text in which BEH WITH HAMZA ABOVE (U+08A1) and BEH with
HAMZA ABOVE (U+0628 U+0654) looked different when displayed in
the same type style/ font/ calligraphy, preferably more
different than the Chinese zi, Japanese ji, and Korean ja in
their normal national representations but noted in Section 2.2
of The Unicode Standard as Unified into U+5B57; more different
from those Arabic and Perso-Arabic characters that are normally
written differently in Arabic as compared to Persian or Urdu but
unified into a single character code in Unicode; and more
different than "ö" in Swedish and "o-umlaut" in German, both of
which were unified into U+00F6 and then allowed to decompose. I
also note that no one (that I know of) has seriously suggested
unifying U+00F6 and U+00F8 ("ø") in spite of the observation
that there is a fairly strong historic Scandinavian linguistic/
writing system argument for doing so as long as one doesn't
confuse that set of languages with Germanic ones that
predominate further south, so either they are too different to
unify despite arguably being less different than some
Japanese-Chinese or Arabic-Persian representations for
characters what were unified or we are past a set of choices
that can really be explained in a consistent way across scripts.
(The "existing standards" criterion, which I normally find very
persuasive, would seem to mostly support "no new precomposed
forms" in this case, but I'm probably missing something.)  I am
_not_ suggesting that any of those decisions were incorrect,
only that, given the IDNA emphasis on character appearance, it
is difficult for us to accept "not the same" _in that IDNA
context_ if the visual differences are not greater than
characters from different languages (and often uses) that were
unified.

(2) I have no idea what U+08A1 is called in Fula.  Some meager
attempts at research have not turned up anything.  If, and with
no disrespect intended, native speakers of that language called
this character something that would translate into English as
"Capped Zorgle" or that would transliterate into Latin
characters as "Zinglefrob", and you folks had named it in 7.0.0
as ARABIC LETTER CAPPED ZORGLE or ARABIC LETTER ZINGLEFROB (the
name is now, I presume, as immutable as the NFC mapping), we
probably would not be having this discussion -- not because the
discussion wouldn't be as valid (or not) with a different name,
but because it is likely that none of us would have spotted the
(real or imaginary) issue.   Conversely, if a native speaker of
Fula calls this character something that translates into English
(possibly via intermediate translation to Arabic) BEH WITH HAMZA
ABOVE and your choice of names to put into the standard was just
a faithful translation of that, then it seems to me that your
"different character" argument is vastly weakened because it
would indicate that, the sort of subtle linguistic arguments
that typical native speakers of a language rarely understand
(and that are irrelevant to them) notwithstanding, _they_ think
it is "the same character".

I know that making analogies across scripts is rife with
potential problems, but, in the hope of understanding the
reasoning about what rates "separate character with no
decomposition" treatment, I note that the Yiddish character
known in that language as Komets-Alef (or Aleph, etc.) is an
independent character very different from the Hebrew use of Alef
(Aleph) with Qamats, has a different name used by native
speakers of Yiddish to identify it, but is coded in Unicode only
as U+FB2F which is treated as a presentation form that does
decompose to 05D0 05B8.   If Yiddish acquired an appropriate
army and navy (apologies who don't know that joke) and UTC were
asked to assign a code point to Komets-Alef as a separate
character (which it is, for Yiddish) rather than as a
presentation form (which it is, for Hebrew), would you be
receptive to making that assignment and not having a
decomposition for the new code point?  If not, what is the
difference from this BEH with HAMZA ABOVE situation?   (In the
interest of clarity, if you added such a code point, I think we
would immediately move to DISALLOW it even though doing so would
be a disservice to the Yiddish-speaking (and writing) community.)

Especially given the similar Hamza-related cases that slipped by
us a half-dozen years ago, I'm personally not sure that, on
balance, exceptionally DISALLOWing U+08A1 is the right thing to
do, at least without taking other actions we probably agree are
problematic at least in principle.  But I'm pretty sure that
assertions that this is a different character despite the same
name as the combining sequence and, as far aw we can tell, an
identical appearance, do not help move us forward.

    john