Table issues (Part 2)
patrik at frobbit.se
Wed Dec 5 23:42:14 CET 2007
Ok, then I would like you to have a very very very close look at the
properties I use in the tables document that lead to the value ALWAYS
for a codepoint. I should probably list explicitly what properties
this is. script is one.
My point is that if a change is made that move codepoints out of
ALWAYS or NEVER, then that change is one that lead to these
"exceptions because of backward compatibility" tables to increase in
Specifically I see on the unicore list a suggestion to move a
codepoint from cyrillic to inherited. That would, if the algorithm is
exactly like what it is today, move (if my calculations are correct)
the codepoint from ALWAYS to MAYBE YES.
On 5 dec 2007, at 16.29, Mark Davis wrote:
> One other addition to what Ken said. Patrik asked:
> Are all old property values part of the Unicode Distribution? So this
> derivation of the property value can be done on "any normal
> installation of Unicode data files"?
> Since the backwards compatibility relationship is transitive, for
> version X
> only needs to have the values for version X-1. So once the IDN
> values for Unicode version X-1 are published, one can use those with
> Version X Unicode properties to derive the new IDN property values
> for X.
> This may point out a bit different perspective on these tables. In our
> experience, the derivation is important to document clearly, and so
> people can verify them with each release, but what programmers
> really need
> are the tables of values. Nobody is going to want to compute the
> values from
> scratch when they are used. What they really want to do is go
> someplace to
> get an approved, machine-readable table of values that they can
> update to.
> Note: The tables are not only defined by the Unicode properties,
> they are
> also defined by categories that are not currently available as Unicode
> properties, but just lists in the document, like
> - Category E - Historical Scripts,
> or the predominant position given to European scripts in
> - 3.1.1. Scripts derived from Common European Scripts
> With each new release of Unicode, those tables will need to be
> updated, and
> the process for doing so needs to be described and justified, very
> On Dec 5, 2007 12:54 PM, Kenneth Whistler <kenw at sybase.com> wrote:
>> Patrik asked a question about Mark's comment:
>>> On 5 dec 2007, at 12.51, Mark Davis wrote:
>>>> This category consists of code points in Unicode that were
>>>> assigned to
>>>> ALWAYS by applying the calculation rules in Section 3 to any
>>>> version of Unicode extending back to Unicode version 5.1. That is,
>>>> if the
>>>> current version of Unicode were 7.0, and character X had been
>>>> determined to
>>>> be ALWAYS according to an application of the rules in, say, Unicode
>>>> then it has a backward-compatibility value of ALWAYS.
>>> Ok, then it was as I thought.
>>> Are all old property values part of the Unicode Distribution?
>> The answer to that particular question, phrased that way, is no.
>> A "Unicode Distribution" consists of the current text of
>> the standard, any delta documents, the Annexes, and -- most
>> importantly for this discussion -- the Unicode Character
>> Database for that version.
>> The Unicode Character Database for a particular version
>> contains the listing of *current* property values for all properties
>> for *that* version -- not the accumulated history of all
>> values of all properties for all versions.
>> All past releases are, of course, archived -- and not in
>> hidden archives -- all the files for each past version
>> are available online from http://www.unicode.org/Public/
>> so in principle it *is* possible to build a complete
>> delta history for all properties. And in fact, Eric Muller
>> has done something along those lines in building
>> XML format files for all versions of the standard.
>>> So this
>>> derivation of the property value can be done on "any normal
>>> installation of Unicode data files"?
>> But as stated, I assume "any normal installation of
>> Unicode data files" is intended to mean Unicode data files for a
>> particular version only -- and no, those do not contain
>> the complete accumulated property history in them.
>> However, the question itself indicates hidden assumptions
>> that are inaccurate.
>> The Unicode Character Database defines approximately 100
>> character properties now -- not counting the many
>> provisional properties defined just for Han characters.
>> Of those properties, only a minority are subject to
>> strict stability requirements that involve the kind
>> of backwards-compatible derivation mechanism that
>> Mark has been talking about.
>> Some of the strictly stabilized properties are simply
>> *immutable*. An example is the character name itself.
>> No change is *ever* allowed for those. So no derivation
>> is required to enforce backwards compatibility.
>> But the kinds of properties that Mark has in mind include
>> the identifier-related properties. The stability
>> guarantee, generally stated, is that once a character
>> is allowed in an identifier-related property, it can
>> never be removed. But since some of the other
>> properties (such as the General_Category) which are
>> used in the derivation of identifier-related properties,
>> may themselves be subject to change between versions,
>> the derivations include a contributory property (or two
>> contributory properties) whose only function is to
>> guarantee stability of the derived property.
>> In the case of identifer-related character properties,
>> those contributory properties are:
>> The characters assigned to those properties in a version
>> of the standard are given those properties so that a
>> derivation of the sort:
>> # Derived Property: ID_Start
>> # Characters that can start an identifier.
>> # Generated from Lu+Ll+Lt+Lm+Lo+Nl+Other_ID_Start
>> can guarantee that ID_Start meets the stability guarantee
>> just mentioned, *even if* a particular character has
>> its General_Category value modified between versions
>> of the standard.
>> It complicates understanding this, but not all contributory
>> properties in the UCD used in property derivation are
>> provided for stabilization of properties. Other_Math is
>> an example. It is used in the derivation:
>> # Derived Property: Math
>> # Generated from: Sm + Other_Math
>> But it is simply a convenience property for getting the
>> right results for the Math property, because there is no
>> other simple way to define the appropriate set of characters.
>> The existence of the Other_Math property doesn't imply
>> that there is a stability guarantee for the Math property
>> per se. And in fact it was just adjusted substantially
>> for Unicode 5.1 to include a new group of characters the
>> mathematical community wanted in it.
>> As it stands right now, as of the imminent Unicode 5.1
>> (and unchanged from Unicode 5.0), the *only* contributory
>> character properties involved in these kinds of
>> stabilization guarantees provided by derivation are:
>> (both used in guaranteeing stability of identifier properties)
>> (both used in guaranteeing stability of the casefolding spec)
>> (used in guaranteeing stability of normalization)
>> Then there is one other data file, NormalizationCorrections.txt,
>> which doesn't define a character property for a version
>> of the standard, but which *does* recapitulate the history
>> of the two normalization corrections in such a way as
>> to allow implementations to provide bug-compatible
>> backwards stability, if they need to.
>> So getting back to the original question. It should be
>> posed as follows:
>> Are all of the contributory properties required to correctly
>> derive stable versions of properties subject to
>> stability guarantees between versions of the Unicode
>> Standard provided for each "Unicode Distribution"?
>> And the answer to *that* question would be yes.
>> P.S. Notably, the Script property never has been subject
>> to a strict stability guarantee, in part for reasons that
>> Mark has separately outlined.
>> Idna-update mailing list
>> Idna-update at alvestrand.no
More information about the Idna-update