"This case isn't the important one" (was Re: Visually confusable characters (8))

Kent Karlsson kent.karlsson14 at telia.com
Tue Aug 12 01:35:48 CEST 2014


Not sure I should get into this hot air... But...

Well, concentrating on just "with hamza above", and trivially just
"grepping" for that (and skipping
those in the Fxxx range), we find:

1) those with canonical decomposition
0623;ARABIC LETTER ALEF WITH HAMZA ABOVE;Lo;0;AL;0627 0654;;;;N;ARABIC
LETTER HAMZAH ON ALEF;;;;
0624;ARABIC LETTER WAW WITH HAMZA ABOVE;Lo;0;AL;0648 0654;;;;N;ARABIC LETTER
HAMZAH ON WAW;;;;
0626;ARABIC LETTER YEH WITH HAMZA ABOVE;Lo;0;AL;064A 0654;;;;N;ARABIC LETTER
HAMZAH ON YA;;;;
06C2;ARABIC LETTER HEH GOAL WITH HAMZA ABOVE;Lo;0;AL;06C1 0654;;;;N;ARABIC
LETTER HAMZAH ON HA GOAL;;;;
06D3;ARABIC LETTER YEH BARREE WITH HAMZA ABOVE;Lo;0;AL;06D2 0654;;;;N;ARABIC
LETTER HAMZAH ON YA BARREE;;;;

2) those with compatibility decomposition
0677;ARABIC LETTER U WITH HAMZA ABOVE;Lo;0;AL;<compat> 06C7 0674;;;;N;ARABIC
LETTER HIGH HAMZAH WAW WITH DAMMAH;;;;

3) those without decomposition
0681;ARABIC LETTER HAH WITH HAMZA ABOVE;Lo;0;AL;;;;;N;ARABIC LETTER HAMZAH
ON HAA;;;;
076C;ARABIC LETTER REH WITH HAMZA ABOVE;Lo;0;AL;;;;;N;;;;;
08A1;ARABIC LETTER BEH WITH HAMZA ABOVE;Lo;0;AL;;;;;N;;;;;

Naïvely, this looks a bit like hit and miss, and I don't know the reasons
behind this (see Ken's messages for a
partial explanation).

However, singling out one of these to "DISALLOW" in IDNA2008 (or is it
IDNA2010) seems to be even more
of miss.

As Roozbeh, Asmus, Mark and Ken W. have pointed out, handing one (possible)
"confusables" case (which
are not compatibility equivalent) in a very different manner from other
"confusables" (that are not
compatibility equivalent) seems to be a very bad idea, for various reasons.

Now, should BEH WITH HAMZA ABOVE been encoded? Maybe not, but that is
irrelevant now. Should
REH WITH HAMZA ABOVE and HAH WITH HAMZA ABOVE been given canonical (or
compatibility) decompositions
(or not been encoded)? Maybe. Or maybe there are valid reason for having
things as they are, just that
the names are too confusing... I'm not sure it is worthwhile diving deep
into the history of just these
few characters (though Ken is doing that, and is most welcome to) to find
out (unless you are deeply
interested in the Arabic script, of course); there are many other cases of
non-equivalent confusables.

And as Ken and Roozbeh pointed out with examples (and Mark without giving
examples in the emails),
there are many other cases that are less "obvious" (from reading the names
only) of very similar but
not (compatibility) equivalent letters. Are you planning on DISALLOWing them
too? Big can of worms...
Not that the cases should not be dealt with, of course they should. See
http://www.unicode.org/reports/tr39/.

----------

On a slightly different point, in the Latin script: Andrew wrote
"in the case of (e.g.) ö (in Swedish) and o-umlaut (in German).
They're clearly different letters linguistically too."

How? They are pronounced the "same" in Swedish and German (except for
differences
only a dialects expert/linguist might notice). IFAIK, they also have the
same history;
"oe" tuning into œ, tuning into oͤ (o with e above), tuning into either ö or
ø. Maybe you
intended to contrast with French or Dutch, where "two dots above" is used
for something
else (as a mark for separate pronunciation as opposed to diftong, French
using œ for
what is written ö in Swedish and German). Despite THAT difference, I would
still say
that the "two" ö (French/Dutch/... vs. Swedish/German/...) are still the
same *character*,
just different orthographic uses. But Swedish and German do collate ö
differently...

And at some level, œ, oͤ, ö, ø, (and even o with ogonek) are the same letter
(for Danish/Norwegian/
Swedish/German/...), even though they do not look all that much alike.

Nit: "Faeltroem" is a major typo in German as well, even though that
*fallback* seems to be
more common in German than for Swedish (where it has been used, huh, back in
"pure ASCII"
times, or when some people use a keyboard without the "local" letters).

/Kent K


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20140812/28d2e84a/attachment-0001.html>


More information about the Idna-update mailing list