Mixing scripts (Re: Unicode versions (Re: Criteria forexceptional characters))

John C Klensin klensin at jck.com
Sun Dec 24 15:07:28 CET 2006



--On Sunday, 24 December, 2006 22:00 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

>...
>> Why would mixing Latin and Greek and Cyrillic at (at least)
>> the same level not be disallowed in IDNs and IRIs to avoid
>> security problems?
> 
> Obviously, disallowing the mixing of Latin and Cyrillic in
> general, at least at this point in time, would punish those
> languages that use an occasional Q or W or whatever from Latin
> amidst Cyrillic.

Let me take this one step further, picking up on several
comments from others.  

As I trust everyone knows, there are controversies in the
anthropological linguistics community about how many writing
systems there are whose origins are completely independent, but
the number is not large -- some scholars would claim as few as
two or three.  We also know that writing systems evolve, adapt,
and absorb characters from geographically or culturally close
other ones.  To use Martin's example, it is not an accident that
the Roman-derived character "W" is called double-u.  It not a
character that Virgil or Cicero ever saw, even though its usual
glyph is more commonly constructed to resemble a "VV" ligature
than a "UU" one (e.g., characters with curvy parts are rather
hard to chisel into stone).

Given that there are languages that use mostly-Cyrillic writing
systems but that have adapted some extra characters from
Roman-derived or (other) Greek-derived alphabets, the primary
things that cause us to see those extra characters as "imported
Latin" rather than part of the Cyrillic set are, ultimately:

	* Cyrillic is often defined in terms of the writing
	system or Russian and _very_ closely related Slavic
	languages.  When the languages that appropriated these
	extra characters are not used by 
	large populations or lacked political or standards
	power, the characters lost out.
	
	* At least some of those losses occurred as a
	consequence of choices made about what got incorporated
	into assorted KOI groupings and GOST standards and hence
	into ISO 8859-5: if a Latin-look-alike character was
	infrequently used in Russian, the disadvantages of
	wasting a code point outweighed the collation sequence
	issues.

What I am suggesting here, and think we should keep in mind, is
that while "script" may be the best tool we have, the boundaries
between scripts as defined in Unicode are somewhat arbitrary in
terms of real-world practices.  I don't see that as a problem,
unless we try to push the script list very far into territory
for which it was not designed or optimized.

    john



More information about the Idna-update mailing list