Unicode 7.0.0, (combining) Hamza Above, and normalization

Whistler, Ken ken.whistler at sap.com
Fri Aug 8 00:56:51 CEST 2014


With respect, let me stop you right there.

My argument mentions linguistic grounds, but the linguistic grounds were

originally relevant to the decision process in the UTC a couple years ago

regarding the encoding of U+08A1 beh-with-hamza as a separate character.

The linguistic grounds are now basically irrelevant to the *current*

discussion. My assertion is that U+08A1 beh-with-hamza as *NOT*

the same as the sequence beh + combining Hamza. And that assertion

can be derived from the decisions and the data published by the UTC

about the encoding. I don’t need to know the U+08A1 is used for Fula

or what sound is involved to be absolutely certain about the identity

issue here.

The same applies to the ghain versus ain + combining dot sequence

I cited. I don’t have to know anything about Arabic to be quite confident

in that claim about *encoding* identity or non-identity, regardless of

whether I “see the same thing” when looking at a printed or screen

rendering of them.

All of this discussion seems to be boiling down to IETF second-guessing

of Unicode character encoding decisions and complaints about Unicode

normalization not satisfying expectations based on rather simplistic

notions of which things that look the same should *be* the same.

In this case, even if there were any marginal improvement to IDNA

that would result from disallowing U+08A1 (which I do not stipulate,

by the way), it is clear that making exceptions in the table derivation

for IDNA because of a one-off quibble about encoding decisions

made by the UTC and normalization just *increases* the overall

complexity and level of confusion about application of the protocol.

Not good.


your argument seems to be based on linguistic grounds and this I can accept but …

