Mixing scripts (Re: Unicode versions (Re: Criteria forexceptional characters))

Sun Dec 24 18:06:54 CET 2006

--On Sunday, 24 December, 2006 14:00 +0000 Michael Everson
<everson at evertype.com> wrote:

> At 22:00 +0900 2006-12-24, Martin Duerst wrote:
> 
>> Obviously, disallowing the mixing of Latin and Cyrillic in
>> general, at least at this point in time, would punish those
>> languages that use an occasional Q or W or whatever from
>> Latin amidst Cyrillic.
> 
> It is Kurdish, and the two letters are for other functional
> reasons being proposed for addition to the standard.

As part of my continued effort to understand which rules apply
and when, adding them to the standard, and identifying them as
"Cyrillic" would seem to violate the "unify when possible" rule.
What am I missing?

> So for
> the sake of argument, assume that this particular reason does
> not apply.

Sure.  But the principle is, IMO, even more difficult.  Let's
take the character "o" as an example.  If Unicode didn't start
with ISO 8859-* and a bunch of other locally developed CCSs as
input and had it started with a strong and consistent
unification rule, we would certainly have looked at that
character and say "it doesn't exist independently in 'Latin' or
'Cyrillic' scripts, it is just an adaptation of a Greek
character and should be unified with it".  Or one might look
back further and make a case for unification with Phoenician Ain
(U+1090F).  I think that the latter would be a stretch, but the
case could be made. 

If there were a clear boundary among Latin, Greek, and Cyrillic
as there is (we hope) between Latin and CJK ideographs, we could
make strong rules on the basis of script.   But, without such
boundaries, we had best be very careful about the rules we cast
into stone... or to try to leave them to local discretion (of
either registries or applications designers).

> Why then would mixing Latin and Greek and Cyrillic at (at
> least) the same level not be disallowed in IDNs and IRIs to
> avoid security problems?

I think the error here is in trying to draw a firm line between
"security problems" and "causing more opportunities for user
confusion than necessary".  Some of those confusion problems are
more severe than others, or for which the possibilities are more
obvious, can be characterized as security issues, but that is
just some point along a spectrum.    If one were trying to be as
secure as possible, short of just not having a network, we would
prohibit entirely every Greek or Cyrillic characters that looks
like a Latin one that has been historically permitted.   That
would not make practical sense, so we don't do it.   

Where we can gain high leverage on a problem by restricting
something that does not appear to have compelling value, we
should make that restriction.    But the first part of that
sentence calls for many judgments about risk and value and we
need, IMO, to understand that we are making those judgments at
every step and, perhaps, that we are not able to be any more
consistent about them than the decisions about script
boundaries, or the nature of writing systems themselves, tend to
be.

     john