Mixing scripts (Re: Unicode versions (Re: Criteria forexceptional characters))

Tue Dec 26 02:54:22 CET 2006

At 19:59 06/12/25, Michael Everson wrote:
>At 13:46 +0900 2006-12-25, Martin Duerst wrote:

>>I consider this a bad example. There is quite some chance that people would prefer a mixed list, so that they don't have to look up a name in two places if they don't know if it's written in Cyrillic or Latin.
>
>I don't think you have thought this through. Russian Kurds consider Aa Ee Oo $B%h‹(BSchwa Qq Ww to be Cyrillic letters, and the behaviour they will get in a monolingual multiscript sort will be that Latin Aa Ee Oo $B%h‹(BSchwa Qq Ww will sort as Latin, Cyrillic Aa Ee Oo $B%h‹(BSchwa Qq will sort as Cyrillic, but as there is not a Cyrillic Ww yet all those names will be interfiled with Latin. There is NO chance that people would prefer a mixed list for that letter.

I don't think you understood what I meant. What I meant was the
following: Assuming that in Kurdish orthography, e.g. Latin 'A'
corresponds to Cyrillic 'A', Latin 'R' corresponds to Cyrillic 'P',
Latin 'S' corresponds to Cyrillic 'C', and so on, the mixed sorting
that I'm proposing is to sort all Latin 'A's with all Cyrillic 'A's,
all Latin 'R's with Cyrillic 'P's, all Latin 'S's with Cyrillic 'C's,
and so on. That way, a user can go directly from pronunciation to
an entry in the list without having to look in two places (one in
the Latin part of the list and one in the Cyrillic part of the list).

In terms of the sorting algorithm, the script difference is
relegated to a level e.g. similar to a case difference
(the standard example for this would be Japanese Katakana and
Hiragana). To what extent this is feasible depends on how clear
and stable the correspondence between the letters or letter
combinations in the two scripts are. I do not know the situation
with respect to Kurdish.

>>But more fundamentally, if there are one or two 'foreign script'
>>letters in context (i.e. you don't have any Kurds named "Mr. W",
>>the actual sorting is still possible: Just check all the Ws, and
>>the characters around them, and change the Ws surrounded by
>>Cyrillic characters to some internally assigned "Cyrillic W" code.
>
>Well, that's a hack, now, isn't it?

Well, yes, you can call it that.

>I wonder just how many companies are going to want to build that kind of thing into their OS.

I don't think you need a special hack just for this. It can
be handled as part of standard collation configuration, listing
up all the Cyrillic + 'W' and 'W' + Cyrillic combinations as
digraphs. At least for 'W' + Cyrillic, that will work easily.
For Cyrillic + 'W', it works with relatively little effort if
'W' is at (or close to) one or the other end of the Cyrillic
alphabet. My guess is that it's very close to the end, but
I might be wrong.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp