IAB Statement on Identifiers and Unicode 7.0.0

Wed Jan 28 03:29:36 CET 2015

On Tue, Jan 27, 2015 at 09:22:31PM -0500, Andrew Sullivan wrote:
> rules about what things you can use by category.  There's no way to
> tell whether you have a two-codepoint composition that renders BEH
> with a HAMZA ABOVE or whether you have a single codepoint BEH WITH
> HAMZA ABOVE (cf. the other _hamza_ cases), _and_ there's no way in
> principle to write any software that could possibly detect that you
> might have an issue here, at least without carrying around a big
> exception table.

Apologies for self-responding, but to be clear: my point is that the
linguistic basis for the abstract character (which TUS necessarily
implies is different between these two cases) is not possible to
derive from the way the characters appear in a snippet of coded
character set data.  Obviously, you can tell whether you have two code
points or one.  What you can't tell is whether you _should_ have two
code points or one.  Compare that with (say) cross-script homoglyphs
even where the raster image is always exactly the same: you still have
the script property.

I'm sorry to go on about this, but I'm trying to be quite clear about
what is different in this case.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com