Mixing scripts (Re: Unicode versions (Re: Criteria for exceptional characters))

Kenneth Whistler kenw at sybase.com
Tue Dec 19 23:24:55 CET 2006


Harald,

> Thanks for pointing out the relevant TR for the use of script codes, and 
> the special status of "Common" and "Inherited". The algorithm grows....

Actually "the algorithm" does *not* grow.

According to the direction that Mark and I have been suggesting,
"the algorithm", whether we are talking about StringPrep proper
of IDNA more generally, does not contain any mixed script detection.
You don't need to implement any of the regular expression
matching suggested in UAX #24.

The development of the inclusion list for IDANbis consists simply
of examining the script property of individual code points and
omitting certain values from the table we build. There is nothing
in there about the script value (or values) associated with
strings. And properly speaking even *that* isn't part of the
algorithm that will eventually be implemented for StringPrep,
but is instead just another (single) rule used in the
construction of the inclusion list. (Although I suppose in
the sense meant by falstrom-idnabis-table, that is part of
the "algorithm" used to create the table.)

And the complexity of the mixed script detection heuristics is
one of the reasons why none of us (except Michael) is suggesting
that it be incorporated as part of the IDNAbis protocol definition
per se.

> 
> --On 19. desember 2006 12:45 -0800 Kenneth Whistler <kenw at sybase.com> wrote:
> 
> >> Is there a list of the Unicode codepoints known to be used in each of
> >> the ISO 15924 script codes?
> >
> > That is an ill-formed question. ...
> 
> I take it this means the answer to my question is "no", since the script 
> names in Scripts.txt and the ISO 15924 codes don't match up.

Actually, I said "ill-formed" advisedly. Answering merely yes or
no to a question that has a mistaken presupposition in it
is a recipe for miscommunication.

So the answer is not merely no, but also that there cannot
be such a list because of the nature of ISO 15924 and
because of the complexity of the notion of "script".

ISO 15924 script codes are intended first to serve as *bibliographic*
script codes. In that function, they can be values in bibliographic
records that specify the script (or scripts) that occur in
a book (or other bibliographic entity), and which thereby help
in cataloging. In that sense, a Script=Inherited value from
Scripts.txt is meaningless. And you can have unusual situations
where a book might, for instance, have facing page translations
using different scripts. In such a case the book itself would
be cataloged with two script codes, but no single page actually
mixes scripts, let alone single lines or words in the text.

On the other hand, ISO 15924 has codes for
script variants that are relevant to typography and to bibliographic
cataloging (such as Fraktur and Gaelic), but which are not
recognized by UAX #24 and Scripts.txt, because they are subtypes of
a particular value, rather than distinct values that help
partition the space of code points.

Scripts.txt, like a number of other Unicode character property
definitions, defines an enumerated property that partitions
*all* Unicode code points, assigned or unassigned. It therefore
is addressing a somewhat different problem than ISO 15924.

UAX #24 and ISO 15924 are, however, maintained in parallel, and
for the *strong* scripts I mentioned before, they attempt to
maintain comparable values and even the same labels, where
possible. This is simply because there is no particularly good
reason to diverge for the identification of obvious script
identities such as Balinese, for example, where the UAX #24
functions (string span matching, regular expression matching,
font selection, etc.) and the ISO 15924 functions (bibliographic
cataloging, and application to language tag definitions)
overlap in identifying a script category.

Trying to assign code points to every script code is also
problematical because the results are rather non-deterministic.
The fact is that letters get borrowed back and forth from one
script to another, and while you might at some point decide
that a letter has become adapted into the borrowing script and
become a new element in it, historically it is very difficult
sometimes to make such determinations during the period of
borrowing and adaptation. Graphologists will disagree.
Character encoders will disagree.

Recent example: the Limbu language is written in Devanagari,
as well as in the Limbu script (and the Latin script). When
written in Devanagari, Limbu borrowed in a glottal stop
from IPA (Latin script). That glottal stop was then adapted
somewhat, so that it would harmonize with a Devanagari font,
but still, it basically is a glottal stop.

So how do you cut the pie on that one? You could say that
the glottal stop is just Latin script, and got borrowed in
and used with the Devanagari script for Limbu. Or you could
say that the adapted form of the glottal stop had become a
new letter of the Devanagari script, thus distinguished from
the Latin script, despite its recent and obvious origin.
The UTC chose the latter as the encoding solution, in part
because it makes Limbu text more likely to appear correct
if it uses a character in a font specifically designed for
the Devanagari script. It also makes span matching for
Limbu in Devanagari a little easier.

But this determination was disliked by *India*, which considers
itself the principle stakeholder for Devanagari, and which
doesn't see the glottal stop as a part of the Devanagari
script at all, for historic reasons.

There are a large number of such edge cases in the history
of writing systems and the separate history of the encoding
of characters for the representation of writing systems.
You can't expect to get a well-formed, black-and-white
answer that simply partitions things for all characters in
all writing systems. You may even end up with muddled
situations where within the same *script*, you get different
determinations for the status of certain characters, depending
on which writing system that script is being used for.

--Ken



More information about the Idna-update mailing list