"This case isn't the important one" (was Re: Visually confusable characters (8))

Mon Aug 11 21:28:20 CEST 2014

On Mon, Aug 11, 2014 at 11:36:26AM -0700, Asmus Freytag wrote:
> I am responding to Vint's message, because, for some reason, I do
> not receive Andrew's messages via the list.

I'm copying you so that you see this one.

> First, the very same case has been in place for ø in Danish (and Norwegian)
> which will look like the sequence o + combining /, but is not deemed
> identical to it.

Just so I understand the example, are you claiming that people using
Unicode ever used such a combining sequence, or are you merely
observing that you can use arbitrary combinations to gin up
similar-appearing things?  I don't think the latter observation is
relevant here.

> As a result, Unicode has the principle of encoding all overlays
> as precomposed forms 

Yes, and nobody is objecting to that. 

> The case under consideration is rather similar. The combining
> hamza exists for a particular use case (Koran), but is otherwise
> not part of the orthography. 

This is a little like saying that in English, æ exists for a
particular use case (Latin-derived words and, in the US, a certain
foppishness about spelling) but is otherwise not part of the
orthography.  I don't see how the "otherwise" helps you.

> As I understand, the use of the
> combined form for a non-Arabic language is unrelated to
> applying  a "hamza" even though it uses the same squiggle.

Sure, I can buy that.

> What this has to do with two letters (whether 'a' and 'a' or
> ö and ö) being used in two different languages is a bit unclear
> to me, so I don't understand Andrew's question.

In the case of ö and ö, it's a perfectly parallel argument, AFAICT.
If you want to argue that U+08A1 is a completely different letter in
an orthography to U+0628 with the decoration U+0654 on it, I'm
prepared to concede that.  But then I want to know why the Swedish use
of ö, which is a letter in the Swedish alphabet, is not
differently-encoded to o-with-an-umlaut in German (which is not a
different letter, but a letter with an accent on it).  If the Swedish
ö and the German ö were encoded differently, then we would not have
the problem that Patrik's last name can get misrendered by a
German-centric but naïve transliterator as "Faeltstroem".  (We'd have
a different problem, of course.)  

This different encoding notion seems to be entailed by the argument
for not normalizing U+08A1 to U+0628 plus U+0654.  What am I missing?

> For obvious reasons, this "thing" tends to happen for minority
> languages, not to say "obscure" ones, if only for the simple
> reason that the common, well-known, and prominent ones
> are all known and accounted for

Well, yes, but this "thing" also means that the minority or obscure
cases have existing work-arounds.  The older cases don't have a
problem, because practices aren't changing depending on the library of
Unicode you happen to have.  So to say that these are rare cases is in
fact to make the point that there is an issue for protocols
_stronger_, not weaker.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com