Table issues (Part 2)

Wed Dec 5 21:54:21 CET 2007

Patrik asked a question about Mark's comment:

> On 5 dec 2007, at 12.51, Mark Davis wrote:
> 
> > This category consists of code points in Unicode that were assigned to
> > ALWAYS by applying the calculation rules in Section 3 to any previous
> > version of Unicode extending back to Unicode version 5.1. That is,  
> > if the
> > current version of Unicode were 7.0, and character X had been  
> > determined to
> > be ALWAYS according to an application of the rules in, say, Unicode  
> > 6.1,
> > then it has a backward-compatibility value of ALWAYS.
> 
> 
> Ok, then it was as I thought.
> 
> Are all old property values part of the Unicode Distribution? 

The answer to that particular question, phrased that way, is no.

A "Unicode Distribution" consists of the current text of
the standard, any delta documents, the Annexes, and -- most
importantly for this discussion -- the Unicode Character
Database for that version.

The Unicode Character Database for a particular version
contains the listing of *current* property values for all properties
for *that* version -- not the accumulated history of all
values of all properties for all versions.

All past releases are, of course, archived -- and not in
hidden archives -- all the files for each past version
are available online from http://www.unicode.org/Public/
so in principle it *is* possible to build a complete
delta history for all properties. And in fact, Eric Muller
has done something along those lines in building
XML format files for all versions of the standard.

> So this  
> derivation of the property value can be done on "any normal  
> installation of Unicode data files"?

But as stated, I assume "any normal installation of
Unicode data files" is intended to mean Unicode data files for a
particular version only -- and no, those do not contain
the complete accumulated property history in them.

However, the question itself indicates hidden assumptions
that are inaccurate.

The Unicode Character Database defines approximately 100
character properties now -- not counting the many
provisional properties defined just for Han characters.
Of those properties, only a minority are subject to
strict stability requirements that involve the kind
of backwards-compatible derivation mechanism that
Mark has been talking about.

Some of the strictly stabilized properties are simply
*immutable*. An example is the character name itself.
No change is *ever* allowed for those. So no derivation
is required to enforce backwards compatibility.

But the kinds of properties that Mark has in mind include
the identifier-related properties. The stability
guarantee, generally stated, is that once a character
is allowed in an identifier-related property, it can
never be removed. But since some of the other
properties (such as the General_Category) which are
used in the derivation of identifier-related properties,
may themselves be subject to change between versions,
the derivations include a contributory property (or two
contributory properties) whose only function is to
guarantee stability of the derived property.

In the case of identifer-related character properties,
those contributory properties are:

Other_ID_Start
Other_ID_Continue

The characters assigned to those properties in a version
of the standard are given those properties so that a
derivation of the sort:

# Derived Property: ID_Start
#  Characters that can start an identifier.
#  Generated from Lu+Ll+Lt+Lm+Lo+Nl+Other_ID_Start

can guarantee that ID_Start meets the stability guarantee
just mentioned, *even if* a particular character has
its General_Category value modified between versions
of the standard.

It complicates understanding this, but not all contributory
properties in the UCD used in property derivation are
provided for stabilization of properties. Other_Math is
an example. It is used in the derivation:

# Derived Property: Math
#  Generated from: Sm + Other_Math

But it is simply a convenience property for getting the
right results for the Math property, because there is no
other simple way to define the appropriate set of characters.
The existence of the Other_Math property doesn't imply
that there is a stability guarantee for the Math property
per se. And in fact it was just adjusted substantially
for Unicode 5.1 to include a new group of characters the
mathematical community wanted in it.

As it stands right now, as of the imminent Unicode 5.1
(and unchanged from Unicode 5.0), the *only* contributory
character properties involved in these kinds of
stabilization guarantees provided by derivation are:

Other_ID_Start
Other_ID_Continue

(both used in guaranteeing stability of identifier properties)

Other_Uppercase
Other_Lowercase

(both used in guaranteeing stability of the casefolding spec)

Composition_Exclusion

(used in guaranteeing stability of normalization)

Then there is one other data file, NormalizationCorrections.txt,
which doesn't define a character property for a version
of the standard, but which *does* recapitulate the history
of the two normalization corrections in such a way as
to allow implementations to provide bug-compatible
backwards stability, if they need to.

So getting back to the original question. It should be
posed as follows:

Are all of the contributory properties required to correctly
derive stable versions of properties subject to
stability guarantees between versions of the Unicode
Standard provided for each "Unicode Distribution"?

And the answer to *that* question would be yes.

--Ken

P.S. Notably, the Script property never has been subject
to a strict stability guarantee, in part for reasons that
Mark has separately outlined.