Combinations with HIGH HAMZA (U+0674) (was: Re: Re: ASIWG feedback on IDNA200X for Arabic Script)

Fri Jul 18 02:22:13 CEST 2008

> --On Saturday, 12 July, 2008 13:30 -0400 Eric Brunner-Williams
> <ebw at abenaki.wabanaki.net> wrote:
> 
> > Agree. If HIGH HAMZA (0674) is combining in Kazakh ("forms
> > digraphs", a textual claim, author unknown)

Author: Dr. Joseph Becker. This textual claim dates from
Unicode 1.0 (1991).

Keep in mind that as of 1991, Arabic character additions
beyond the core Arabic characters of ISO 8859-6 (ultimately
deriving from ASMO 449) were an innovation in character encoding,
up to then only seen in XCCS and in much more limited sets
from IBM for a few major languages, and so on. The annotations added
at the time for Unicode 1.0 were in part a shorthand set
of justifications for additions of such characters, pointing
out languages which used them in their orthographies.

If you want more details, the digraphs in question are the
vowel digraphs used when Kazakh is written in the Arabic
script. These correspond to the common Turkic vowels
written in the Latin script as ä, ö, ï, ü, but which don't
have natural, obvious representations in the Arabic script.
For exhibits, see:

http://www.omniglot.com/writing/kazakh.htm

That page shows the four digraphs in question written with
the ordinary Arabic hamza (U+0621), instead of a high hamza
letter, but I presume this is a typographical alternation,
and that Dr. Becker had printed examples showing the high
hamza used in such digraphs.

See also:

http://en.wikipedia.org/wiki/Kazakh_alphabet

for a cross-comparison (not completely correct, I think)
between the Cyrillic, Latin, and Arabic orthographies.
In those you see usage of the high hamzas.

> > and is a character
> > in Jawi (informed user, national standard, etc), that is, is
> > both A and not A, which is the controlling property?

Jawi uses ordinary hamza, U+0621, and then some ordinary
Arabic letters that decompose using the combining Arabic
hamza above, U+0654.

This is not a matter of A and not A, by the way.

The General_Category property for U+0674 is gc=Lo, as for
all other Arabic letters. It is technically not a "combining
mark" as defined by the Unicode Standard. The status of
a letter as participating in a "combination" identified
as a digraph does not in this case make U+0674 a
combining mark, any more than the participation of the
letter "h" in the English digraphs "th" and "ch" would
make that ASCII character a combining mark.

On the other hand, the combining Arabic hamza, U+0654,
*is* a combining mark that stands above the skeleton
of the basic Arabic letter forms and which is decomposed
as a separate mark in some representations.

> 
> Just my opinion, but...
> 
> Ultimately, if it is required for some languages, then we will
> almost certainly need to permit it and hope that any issues can
> be dealt with at the registry level for registries (zone
> administrators) whose focus is on scripts that don't need it.

Correct. There is no reason, for the Arabic script, not
to include U+0621, U+0674, *and* U+0654.

Indeed, those are all in the recommended list we just
received from the Arabic Script IDN Working Group, even
after their considerations for removal of various Koranic
annotation marks, etc.

> There is actually a separate issue on which I'm not competent to
> comment, but Mark, Ken, Michael or others probably can and
> should.  I'm going to try to state this exactly, but apologize
> in advance if I get it wrong.   
> 
> Suppose there is a glpyh that is identical in appearance (and
> maybe even name) to U+0674, and
> 
> 	--  that glyph is used as a stand-alone character in
> 	some contexts (e.g., in Jawi given the discussions Eric
> 	cites), but
> 	
> 	--  is also used as a combining character (e.g., in
> 	Kazakh, again given the discussions Eric cites)

Actually, as I indicated in the analysis above, it is mostly
the other way round, and the identification of the Jawi
character was incorrect.

> should it be assigned (or should it have been assigned) two
> separate code points, one combining (or digraph forming, which
> is not quite the same thing in what I understand to be the
> Unicode vocabulary?

There are three code points here: U+0621, U+0654, U+0674.

And note there are ongoing discussions about the need
to encode yet another hamza for the Arabic script, the
so-called "chairless hamza", which has different shaping
behavior than either the basic letter hamza (U+0621) or
the basic diacritic hamza (U+0654).

> Other questions that might complicate this one:
> 
> (1) Are the relevant digraphs for Kazakh (and anywhere else High
> Hamza is digraph-forming) those that follow it in the Unicode
> table, i.e.,
>   U+0675 ARABIC LETTER HIGH HAMZA ALEF
>   U+0676 ARABIC LETTER HIGH HAMZA WAW
>   U+0677 ARABIC LETTER U WITH HAMZA ABOVE
>   U+0678 ARABIC LETTER HIGH HAMZA YEH
> or are there others?

Yes, those are the Kazakh digraphs in question.

> 
> (2) The four characters listed above (U+0675 - U+0678) have
> compatibility compositions (NFKC) to a base characters plus
> U+0674. That implies that, with IDNA2003 and nameprep, they
> cannot appear at in as U-labels (strings back-translated from
> valid A-labels) -- only the decomposed forms can appear.  Tables
> (both 01 and 02) disallows these characters, effectively
> creating the same situation as occurs in IDNA2003.  I think that
> is what we want, but someone more familiar with Kazakh than I am
> should confirm this.

I think that is correct. You can just spell the Kazakh digraphs
out. Note that this is tricky for Kazakh, because you need
to keep in mind that this is a right-to-left script.

We'll probably need to bring in an expert on Central Turkic
in Arabic script, but my understanding from the examples would
be that ordinarily when writing Kazakh or similar Turkic
languages in the Arabic script, you wouldn't be explicitly
writing the vowel rounding harmony distinctions (the way
you need to when using the Latin Turkic alphabets). Only
if you *explicitly* want to distinguish in writing the harmonic
pairs would you add the diacritic hamza, forming the digraphs.
In Unicode this would be done by encoding U+0674 *after* the
main letter. Thus:

U+0675 (ä) --> U+0627 (alef = "a") + U+0674 (high hamza)

But when you *render* that, the glyph for the hamza appears
at the upper right corner of the alef (not above it, and
not to its left, i.e. "after" it in rendering order).
This is most easily handled by mapping the sequence
<0627, 0674> to a preformed digraph glyph in your Arabic
font, rather than trying to swap glyphs on the fly.

> But, if the digraphs are not permitted in precomposed form, then
> I wonder if the difference in handling between Jawi and Kazakh
> is really problematic or, put differently, more problematic than
> the issues associated with rendering identical code point
> sequences correctly

The presupposition is incorrect. We aren't talking about
the same code point sequences for Jawi and for Kazakh.

> when the conventions for doing so are
> different for different languages.   The answer to that question
> depends on whether, e.g., HIGH HAMZA ALEF (the digraph form) and
> ALEF + HIGH HAMZA (the two-character form) can both occur in
> written or printed Kazakh and whether, if they do, they have
> different meanings/ interpretations (e.g., would a word
> containing the first be considered the same as a word containing
> the two-character sequence)?

My interpretation would be no. This is simply a typographical
issue for the font, regarding where exactly the hamza is
rendered. Hamza rendering, by the way, is tricky for all
languages using the Arabic script -- hence the discussions
about the need for a chairless hamza. But that would have
no bearing on the set of characters needed for IDN for
Kazakh, IMO.

Note, too, that for Kazakhstan itself, really only the Cyrillic
and Latin alphabets for Kazakh are in play, so these
issues regarding hamza are unlikely to concern the registrar
for the kz ccTLD. The Arabic script
orthography for Kazakh is basically only official for
Kazakhs in the PRC.

> 
> Note that this question is not addressed at all in the document
> "List of All Arabic Script Combining Marks", presumably because
> HIGH HAMZA is not, in what I understand to be the Unicode sense,
> a combining mark (it spaces but [sometimes?] forms digraphs,
> rather than being non-spacing).

Correct.

--Ken

> 
> Or have I misunderstood something here?
> 
>     john