Mixing scripts (Re: Unicode versions (Re: Criteria forexceptional characters))

Sun Dec 24 20:06:13 CET 2006

--On Sunday, 24 December, 2006 18:14 +0000 Michael Everson
<everson at evertype.com> wrote:

> At 12:06 -0500 2006-12-24, John C Klensin wrote:
> 
>>  > It is Kurdish, and the two letters are for other functional
>>>  reasons being proposed for addition to the standard.
>> 
>> As part of my continued effort to understand which rules apply
>> and when, adding them to the standard, and identifying them as
>> "Cyrillic" would seem to violate the "unify when possible"
>> rule. What am I missing?
> 
> Functional requirements, such as sorting monolingual
> multiscript text.

This would seem reasonable, except for the number of times we
have been told that block structure, and the ordering of
characters within a block, have nothing to do with collation
sequences.  There is one advantage to keeping scripts together
but it relates only to IDNA: punycode takes advantage of
character proximity for compactness.  So, conversely, if the
characters of a label are scattered all over the place, the
punycode representation will tend to get longer than one would
like, putting a shorter-than-needed length restriction on the
name as the user sees it.

>> If Unicode didn't start
>> with ISO 8859-* and a bunch of other locally developed CCSs as
>> input and had it started with a strong and consistent
>> unification rule, we would certainly have looked at that
>> character and say "it doesn't exist independently in 'Latin'
>> or 'Cyrillic' scripts, it is just an adaptation of a Greek
>> character and should be unified with it".
> 
> No, never, because of the functional requirements. One could
> not expect <o> to sort in three different places in a
> multilingual glossary (Russian, English, Greek).

See above about collation.  And note that, even within the
fairly basic set of decorated Latin characters, logical sort
order is a localization (language at least) issue, not one that
Unicode can possibly address properly.

> My point is that those are no different for Kurdish, which has
> Latin, Cyrillic, and Arabic orthographies. Kurdish Cyrillic
> uses Aa, Ee, Oo, Öö, and <Schwa><schwa> already which are
> identical between Latin and Cyrillic, plus it uses Qq and Ww.
> Since we have found Cyrillic Q's which have a different
> capital shape than Latin ones do, it's quite possible that
> CYRILLIC LETTER QA will be added, in which case there is but
> one straggler, CYRILLIC LETTER WE, and my argument is that
> there is no advantage to Kurdish in sticking to the
> unification.

To the extent to which I understand this, I agree with you.  My
only points are (i) that some views of consistency are becoming
the victim of this particular set of requirements and (ii) one
net effect is to introduce more cross-script confusables.

That gets me back to the point I tried to make in my first
posting on this particular thread: we need to understand and
accept that there are some complex tradeoffs involved here and
simplified rules, applied consistently, are not going to help us
very much.

> But I guess this is not the venue for this discussion. I
> understand that a script-ban will not be deeply embedded.

And this is part of the reason why.

    john