"This case isn't the important one" (was Re: Visually confusable characters (8))

Whistler, Ken ken.whistler at sap.com
Mon Aug 11 21:44:34 CEST 2014


Andrew,

> In the case of ö and ö, it's a perfectly parallel argument, AFAICT.
> If you want to argue that U+08A1 is a completely different letter in
> an orthography to U+0628 with the decoration U+0654 on it, I'm
> prepared to concede that.  But then I want to know why the Swedish use
> of ö, which is a letter in the Swedish alphabet, is not
> differently-encoded to o-with-an-umlaut in German (which is not a
> different letter, but a letter with an accent on it).  If the Swedish
> ö and the German ö were encoded differently, then we would not have
> the problem that Patrik's last name can get misrendered by a
> German-centric but naïve transliterator as "Faeltstroem".  (We'd have
> a different problem, of course.)
> 
> This different encoding notion seems to be entailed by the argument
> for not normalizing U+08A1 to U+0628 plus U+0654.  What am I missing?

What you are missing is that Swedish versus German use of a letter
is 30 years water under the bridge and was decided for *encoding*
characters by, among other things, ISO 8859-1, which encoded *one*
character for both. Unicode *inherited* that decision in 1989.
The Unicode Standard has to live with the inherited encoding legacy,
and there is no point in re-litigating old decisions like that 25 years
after the fact.

What you are also missing is that the status of characters as "letters"
in orthographies for one language or another is also beside the point
here.

U+08A1 was *NOT* encoded separately simply because it was a
letter of the Fula orthography.

U+08A1 was encoded separately because the Hamza above the beh
was functioning as an ijam *AND* because the principle had been
established in the Unicode Standard (very explicitly, I might add),
that new Arabic characters consisting of a skeleton plus ijam
are separately encoded. The combining Hamza above mark is
not to be used for these ijam mark diacritic extensions, any more
than the combining dot below in Arabic for some voweling systems
in Africa, U+065C ARABIC VOWEL SIGN DOT BELOW, is to be
used to create apparently identical looking sequences involving
a skeleton letter and an ijam dot below, such as U+0751 ARABIC LETTER
BEH WITH DOT BELOW AND THREE DOTS ABOVE.

Is that any clearer?

--Ken


More information about the Idna-update mailing list