Mixing scripts (Re: Unicode versions (Re: Criteria forexceptional characters))

Michael Everson everson at evertype.com
Sun Dec 24 20:18:35 CET 2006


At 14:06 -0500 2006-12-24, John C Klensin wrote:

>This would seem reasonable, except for the 
>number of times we have been told that block 
>structure, and the ordering of characters within 
>a block, have nothing to do with collation 
>sequences.

No, John, I am not talking about binary sorting. 
I'm talking about one letter that would have to 
be in two places at the same time.

Consider a list of personal names in Kurdish. 
Some in Latin, some in Cyrillic. You want to sort 
them. It's a single language. There is only one W 
available. How will you sort the names beginning 
with W? They will all interfile, Latin and 
Cyrillic, at Latin W.

The environment here is plain text: file names in 
a directory. No language tagging. No ISO 15924 
script tagging. No fancy XML.

There are scientific linguistic environments 
where mixing of Latin Greek and Cyrillic are 
expected, but this is a standard orthography of a 
natural language -- and not just a language that 
might mix letters, but a language with more than 
one official orthography where the lack of 
CYRILLIC WE is an actual problem.

>  > No, never, because of the functional requirements. One could
>  > not expect <o> to sort in three different places in a
>  > multilingual glossary (Russian, English, Greek).
>
>See above about collation.  And note that, even 
>within the fairly basic set of decorated Latin 
>characters, logical sort order is a localization 
>(language at least) issue, not one that Unicode 
>can possibly address properly.

Not so, really. Collation is handled very well, 
and tailoring too. This is rather different it 
seems to me.

>To the extent to which I understand this, I 
>agree with you.  My only points are (i) that 
>some views of consistency are becoming the 
>victim of this particular set of requirements 
>and (ii) one net effect is to introduce more 
>cross-script confusables.

Well, disadvantaging Kurds who use Cyrillic Aa Ee 
Oo Öö Schwa Qq by denying them Ww doesn't seem 
like the right thing to do, which is why I'm 
proposing to add some characters to the UCS. 
After all, the UCS is for A GREAT MANY MORE 
THINGS than IDN.

>  > I understand that a script-ban will not be deeply embedded.
>
>And this is part of the reason why.

Glad Yule to all.
-- 
Michael Everson * http://www.evertype.com


More information about the Idna-update mailing list