Normative data files (Re: Normalization of Hangul)

Thu Feb 21 21:02:50 CET 2008

Harald noted:

> And I'm very far from sure I know which are which - since Ken (if I 
> understood him rightly) insists that even "DerivedCoreProperties.txt" is 
> normative, and the algorithms used to produce it aren't.

Correct.

> 
> Ken: The "5.0.0/ucd" directory at the Unicode FTP server contains the 
> following files:

> Can you please identify which ones of these are normative, and which 
> ones (if any) are not?
> 
> (It would also be convenient if you could identify a resource for 
> getting that answer without asking you....)

http://www.unicode.org/Public/UNIDATA/UCD.html

See the section in that documentation on UCD Property files.
The third field in the table identifies the status of the
various properties specified in those data files:

Normative (N)
Informative (I)
Provisional (P)

"Normative" in the sense used there reflects the definitions
given in the standard in Section 3.5, pp. 85-87:

D33 Normative property: A Unicode character property used in
    the specification of the standard.

D34 Informative property: A Unicode character property whose
    values are provided for information only.

D36 Provisional property: A Unicode character property whose
    values are unapproved and tentative, and which may be incomplete
    or otherwise not in a usable state.

So Joining_Type and Joining_Group are both normative properties,
because they are used in the specification of Arabic shaping
in the standard. Composition_Exclusion is a normative property,
because it is used in the specification of Unicode
Normalization, and so forth.

For the large number of CJK-related character properties defined
in Unihan.txt, there is a parallel documentation file, which
also specifes for each property whether it is normative,
informative, or provisional:

http://www.unicode.org/Public/UNIDATA/Unihan.html

The UTC gave up some time ago trying to specify for each
data file whether it was "normative" or "informative", in
part because there are data files that contain both
normative and informative property definitions, and even
some of the documentation files give information "used in
the specification of the standard", so should also be
considered normative. The current general take, I think
is that all files listed in the component listing for
any version of the standard, e.g.:

http://www.unicode.org/versions/components-5.0.0.html

should be considered normative parts of the standard.
Certainly it isn't "optional" to omit any of them when
releasing the standard.

Then there is the separate issue of which source is
*definitive*, when the same information about a property
is represented in more than one way. The answers for
that can also be obtained in UCD.html, but the
short answer is:

Core Data

   These data files define simple properties, and
   the lists are definitive.

Derived Data

   These data files define derived properties, and
   contain explanations about how the derivations are
   done. The lists are definitive, rather than the
   explanations of the derivations.

Extracted Data

   These are re-presentations of property data by
   value, extracted from some of the core data files.
   In all cases, the lists in the core data files
   themselves are definitive.

Auxiliary Data

   These are a mix of definitions of derived
   properties and some test data. (The reason they are
   sitting a separate auxiliary directory is that they
   are associated with algorithms in Unicode Standard
   Annexes.) For the data files that define properties
   (GraphemeBreakProperty.txt, WordBreakProperty.txt,
   SentenceBreakProperty.txt), their status is just like
   that of Core Data or Derived Data: the lists are
   definitive, rather than any explanations of the
   derivations involved.

XML Data

   Starting with Unicode 5.1, the UTC will also be releasing
   an XML version of the entire Unicode Character Database.
   The XML version is considered another instance of
   extraction of property data. If it differs in any
   way from the property listings in the Core and Derived
   Data, the lists for the Core and Derived Data prevail.

The UTC is in the process of transitioning from maintaining
this kind of information in documentation files that
are somewhat "hidden" in the mass of files in the
Unicode Character Database (UCD.html and Unihan.html) to
more structured, visible, and referenceable Unicode
Standard Annexes:

UAX #38, Unicode Han Database (Unihan)
UAX #42, Unicode Character Database in XML
UAX #44, Unicode Character Database

For Unicode 5.1, UAX #44 is just a general framework, and
still refers to UCD.html for the bulk of its information.
But the goal is to elevate all such information into UAX #44
itself in the future.

Given the conversation ongoing here about use of Unicode
property data files, it seems clear to me that one area
where UAX #44 should be expanded and clarified in the future
is specifically to spell out the "what prevails" rules for
lists and derivations when it comes to derived properties
in the UCD.

--Ken