Mixing scripts (Re: Unicode versions (Re: Criteria for exceptional characters))

Kenneth Whistler kenw at sybase.com
Tue Dec 19 21:45:39 CET 2006


Harald wrote:

> We still have to sort out a definition of "script" that makes the 
> statement "Don't mix scripts" an actionable statement.

Mark pointed to relevant documents, but...

> 
> If we use the Unicode script names from the Unicode database's 
> "Scripts.txt", 0-9 aren't in "Latin", they're in "Common".
> The combining accents are mostly in "Inherited".

Correct.

> 
> If, by "don't mix scripts", you mean "scripts Common, Inherited and X 
> from unicode/Scripts.txt can be mixed in one string, for any value of X, 
> but no other mixing is allowed", we can discuss that statement.

For the purposes of matching spans of characters, which was the
main original focus of Scripts.txt and its accompanying
specification document, UAX #24:

http://www.unicode.org/reports/tr24/

Characters with the Inherited script property inherit the script
of the base character they are applied to.

Characters with the Common script property are resolved based
on the context of the surrounding characters, a little like
the bidi property value ON is resolved to either R or L, depending
on context in the Unicode Bidirectional Algorithm.

So roughly, you could view the other, explicit script property
values, like Latin and Cyrillic as *strong* script values,
and Common and Inherited as weak script values. An implementation
that was looking for "mixed script" in a string would be
checking for the coexistence of more than one strong script
value in that string.

> But I'm 
> not at all sure we're all talking about the same thing when we discuss 
> the statement.

O.k., now are we talking about the same thing? See the discussion
on using script names in regular expressions in UAX #24 for a
more formal description of the intent.

> Is there a list of the Unicode codepoints known to be used in each of 
> the ISO 15924 script codes?

That is an ill-formed question. ISO 15924 defines script codes.
It does not define repertoires or associate code points with
those script codes. So you can't have sets of Unicode code points
"in each ISO 15924 script code".

The closest you are going to get to an repertoire partitioning
of Unicode into scripts is Scripts.txt, the very file we have
been talking about and using for the development of the
inclusions file.

--Ken




More information about the Idna-update mailing list