Unicode & IETF

Tue Aug 12 00:48:12 CEST 2014

John,

> The other irony about this is that, if you want consistent and
> easily predictable behavior, you should be asking for exactly
> whet we thought we were promised -- no new precomposed
> characters under normal circumstances unless there were
> characters missing from the combining sequences that could
> otherwise be used to form them and, if any exception was made to
> add one anyway, it should decompose back to the relevant
> combining sequence.  

And the *other* other irony about this is that that is exactly
what you have gotten!

The relevant data file to watch for this is:

http://www.unicode.org/Public/UCD/latest/ucd/CompositionExclusions.txt

which carries all the information about the "funny cases" for
normalization.

The last time an exception of the type you are talking about --
a new precomposed character being added, for which a
canonical combining decomposition had to be added simultaneously
to "undo" the encoding as a single code point, so that the NFC
form was decomposed, rather than composed, was for
U+2ADC FORKING. That went into Unicode 3.2, in March, *2002*,
12+ years ago.

The claim in the Unicode Standard for the cases like U+08A1
beh-with-hamza and U+A794 c-with-palatal-hook, is that these
are *ATOMIC* characters, and not actually precomposed characters.
Therefore they do not fall afoul of the stability issue for the
formal definition of Unicode normalization. They are encoded
because they are *not* equivalent to what might naturally be
taken as a pre-existing combining sequence.

This is an old, old, old, old issue for the Unicode Standard. The
line has to be drawn somewhere, and it is not at all self-evident,
contrary to what some people seem to think.

Cases like adding acute accents to letters clearly fall on one
side of the line.

Cases like making new letters by erasing parts of existing letters
or turning pieces around clearly fall on the other side of the
line. (Meditate, perhaps, on U+1D09 LATIN SMALL LETTER TURNED I.)

Cases which make new letters by adding ostensibly systematic
diacritics which then fuse unpredictably into the forms of the
letters are the middle ground, where the UTC had to place
a stake in the ground and decide whether to treat them all
as atomic units, despite their historic derivative complexity,
or to try to deal with them as ligated  display units based on encoded
sequences.

And just to make things more complicated, the encoding for
Arabic was/is systematically distinct from that for Latin.
For Latin, the *default* is that if the diacritic is visually
separated from the base letter, then a precomposition is
presumed. However, for Arabic, because of the early history
of how Arabic was encoded on computers, the line is drawn differently:
any skeleton + ijam (diacritic) combination is encoded atomically,
and is not treated as a sequence. The known exceptions to
that principle are well documented and historically motivated.

This is noticeably messy, of course, both because writing systems
are messy, and because the history of computer character
encoding itself is messy and full of hacks. However, two
points should stand out for the takeaway here:

1. It is not possible to have a completely internally consistent diacritic
encoding story for Latin (and scripts like it). Quit trying to
have that pony!

2. It is not possible to have a completely consistent model
for how diacritics are handled between Latin and Arabic.
Quit trying to have that pony, too!

What we have in Unicode instead is a perfectly serviceable donkey
that can move your cart to market.

--Ken