Combinations with HIGH HAMZA (U+0674) (was: Re: Re: ASIWG feedback on IDNA200X for Arabic Script)

Thu Jul 17 22:50:34 CEST 2008

--On Saturday, 12 July, 2008 13:30 -0400 Eric Brunner-Williams
<ebw at abenaki.wabanaki.net> wrote:

> Agree. If HIGH HAMZA (0674) is combining in Kazakh ("forms
> digraphs", a textual claim, author unknown) and is a character
> in Jawi (informed user, national standard, etc), that is, is
> both A and not A, which is the controlling property?

Just my opinion, but...

Ultimately, if it is required for some languages, then we will
almost certainly need to permit it and hope that any issues can
be dealt with at the registry level for registries (zone
administrators) whose focus is on scripts that don't need it.
Of course, if the relevant folks considering IDN use in the
languages that require these characters decide they don't need
them in domain names, the problem goes away.

There is actually a separate issue on which I'm not competent to
comment, but Mark, Ken, Michael or others probably can and
should.  I'm going to try to state this exactly, but apologize
in advance if I get it wrong.   

Suppose there is a glpyh that is identical in appearance (and
maybe even name) to U+0674, and

	--  that glyph is used as a stand-alone character in
	some contexts (e.g., in Jawi given the discussions Eric
	cites), but

	--  is also used as a combining character (e.g., in
	Kazakh, again given the discussions Eric cites)

should it be assigned (or should it have been assigned) two
separate code points, one combining (or digraph forming, which
is not quite the same thing in what I understand to be the
Unicode vocabulary?

Other questions that might complicate this one:

(1) Are the relevant digraphs for Kazakh (and anywhere else High
Hamza is digraph-forming) those that follow it in the Unicode
table, i.e.,
  U+0675 ARABIC LETTER HIGH HAMZA ALEF
  U+0676 ARABIC LETTER HIGH HAMZA WAW
  U+0677 ARABIC LETTER U WITH HAMZA ABOVE
  U+0678 ARABIC LETTER HIGH HAMZA YEH
or are there others?

(2) The four characters listed above (U+0675 - U+0678) have
compatibility compositions (NFKC) to a base characters plus
U+0674. That implies that, with IDNA2003 and nameprep, they
cannot appear at in as U-labels (strings back-translated from
valid A-labels) -- only the decomposed forms can appear.  Tables
(both 01 and 02) disallows these characters, effectively
creating the same situation as occurs in IDNA2003.  I think that
is what we want, but someone more familiar with Kazakh than I am
should confirm this.

But, if the digraphs are not permitted in precomposed form, then
I wonder if the difference in handling between Jawi and Kazakh
is really problematic or, put differently, more problematic than
the issues associated with rendering identical code point
sequences correctly when the conventions for doing so are
different for different languages.   The answer to that question
depends on whether, e.g., HIGH HAMZA ALEF (the digraph form) and
ALEF + HIGH HAMZA (the two-character form) can both occur in
written or printed Kazakh and whether, if they do, they have
different meanings/ interpretations (e.g., would a word
containing the first be considered the same as a word containing
the two-character sequence)?

Note that this question is not addressed at all in the document
"List of All Arabic Script Combining Marks", presumably because
HIGH HAMZA is not, in what I understand to be the Unicode sense,
a combining mark (it spaces but [sometimes?] forms digraphs,
rather than being non-spacing).

Or have I misunderstood something here?

    john