"This case isn't the important one" (was Re: Visually confusable characters (8))

Mon Aug 11 21:28:50 CEST 2014

Andrew stated:

> To state it again, my concern -- my only concern -- is that this
> addition to Unicode appears to be a case where a precomposed
> character, which was previously possible to create (for some value of
> "possible" and "create") with a combining sequence, is added without
> NFC causing the new character and the previous combining sequence to
> match.  That behaviour is surprising to me given what I understood at
> the time we worked on and published IDNA2008. 

O.k., let me take that assertion at face value.

Let's go back to Unicode 7.0.0 repertoire and actually *examine* it, instead of
freewheeling random examples picked out of the air.

In addition to the Arabic U+08A1 that seems to have kicked off this
concern and the internet draft under discussion, we also have:

U+0529 CYRILLIC LETTER EN WITH LEFT HOOK

For Orok.  For some value of "possible" and some value of "create", that
was "the same" as the existing U+043D CYRILLIC LETTER EN + 
U+0321 COMBINING PALATALIZED HOOK BELOW (which takes different
positions, depending on the base letter it attaches to). But for the
reasons Asmus cited regarding the longstanding and obvious Danish ø,
this pattern of introducing a "precomposed" character of this sort,
without an NFC-destabilizing canonical decomposition, is
established practice for these kinds of letters. It happens regularly.
Not often, but regularly. There tend to be handfuls for each major
release.

See also, from Unicode 7.0.0 new additions (for Latin):

U+A794 LATIN SMALL LETTER C WITH PALATAL HOOK
!= U+0063 + U+0321

U+A795 LATIN SMALL LETTER H WITH PALATAL HOOK
!= U+0068 + U+0321

U+A79F LATIN SMALL LETTER F WITH STROKE
!= U+0066 + U+0335 COMBINING SHORT STROKE OVERLAY

I am guessing that you have taken on board the Unicode formal non-equivalence
of these "precomposed" characters that have diacritics attached
or overlaying the base letters, even though *logically* these
diacritic modifications are, at another level the "same" as the
base character plus the application of the combining diacritic
in question.

If so, then the argument devolves to nothing more than an assertion
that you cannot or will not accept that the situation for U+08A1
is similar. And for that case, it seems to boil down to viewing
that the Hamza, because it floats over and is not attached,
must somehow *be* equivalent because it looks the same.
But this follows mostly from the way Arabic is written: cursively,
with the ijam dotted in afterwards, a little like people write
Latin cursively, and go back and add in dots on I's and strokes
on t's.

Would it have helped if the Fula had somehow created their letter
by ligating the Hamza onto the beh a bit, the way Cyrillic and Latin
alphabets tend to innovate their palatalized or retroflex consonant
innovations?

Would it help to note that the combining Hamza for *Arabic*,
representing a glottal stop over consonants read as long vowels
might end up being *rendered* somewhat differently in some
contexts than a fixed ijam over a beh required in *Fula* for
distinguishing two different consonants in the orthography?
To my mind, those concerns ought to lead to the recognition
that for U+08A1 the case for *non*-equivalence to any preexisting
sequence is even stronger than the introduction of letters like
the c-with-palatal-hook noted above, which are simply following
the cookbook recipe that Unicode atomically encodes base
letters with diacritics when the diacritic are attached or struck
through, even if there is no semantic distinction intended.

In any case, at this point I find it surprising that anyone who had
been paying close attention to Unicode for the last 15 years or so
would find the regular (not common, not rare) introduction of these
kinds of letters into the standard to actually be surprising. Yes, the
particular case of the combining Hamza in Arabic is an interesting
and difficult edge case -- which is why it is documented at length
and very explicitly. But this is *not* something new to the standard,
nor is it contrary in any way to the established expectations
about how Unicode normalization should function.

> What is important at least for me now is to understand the extent to
> which this sort of thing happens, what our expectation ought to be in
> the future about its recurrence, and what implications that has for
> how we build network protocols atop Unicode.

It will recur. These kinds of situations are built into the structure
of a number of scripts -- most notably Latin, Cyrillic, and Arabic.
They should not be surprises.

--Ken

> 
> Best regards,
> 
> A
> 
> --
> Andrew Sullivan
> ajs at anvilwalrusden.com
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update