Inclusion Table Update

Wed Jan 31 03:16:43 CET 2007

Following on further discussion about this topic, I
have done some updating of the draft inclusion table
that I have posted for review.

I cleaned out the older drafts of the table from my
public access directory, so it should be less confusing to look there.
What is posted there now are:

http://www.unicode.org/~whistler/SPInclusionList061219.txt
http://www.unicode.org/~whistler/SPInclusionAdd070130.txt
http://www.unicode.org/~whistler/SPInclusion.txt
http://www.unicode.org/~whistler/IDNPermitted.txt

SPInclusionList061219.txt is the same as we had discussed
before. It is the list of all characters that met the
various criteria:
  GeneralCategory(cp) == {Ll, Lo, Lm, Mn, Mc, Nd}
  cp = NFKC (cp)
  not one of the primarily historic scripts
  various other specific exclusions of combining marks, etc.
It omits all the Han characters and Hangul syllables, because
in this format that would be overwhelmingly verbose and
would contribute nothing to reviewing the rest of the content.

SPInclusionAdd070130.txt is the small list of specific
exceptions to the *exclusions* -- in other words, the list
of characters that have to be added back in to make the
overall list workable. That consists of:

  The LDH characters: "-" and "A"-"Z"
  Hebrew geresh and gershayim
  ZWJ and ZWNJ
  And, for argument's sake for the moment, U+00B7 MIDDLE DOT.

SPInclusion.txt is the result of concatenating the two files
and sorting by code point, i.e.:

cat SPInclusionList061219.txt SPInclusionAdd070130.txt | sort > SPInclusion.txt

So *that* file is the file to review if you just want to see the
complete list of all characters being proposed for the
inclusion table (other than the Han and Hangul, of course),
in easily reviewable format, with the General_Category and Script
property values listed for each character, along with their names.
So typical entries are:

000B7 gc=Po sc=Zyyy MIDDLE DOT
000E0 gc=Ll sc=Latn LATIN SMALL LETTER A WITH GRAVE
000E1 gc=Ll sc=Latn LATIN SMALL LETTER A WITH ACUTE
...

O.k., now the *new* thing to note is IDNPermitted.txt.
That is a processed form of SPInclusion.txt that reformats
all the data into a standard format Unicode character property
data file, using code point ranges where possible to compress
the expression of the data. It has a standard format header,
which will be familiar to users of other Unicode character
property data files, and it also includes all the relevant
Han and Hangul ranges, so it is *complete*. Typical
entries look like:

00B7         ; IDN_Permitted # Po         MIDDLE DOT
00E0..00F6   ; IDN_Permitted # Ll    [23] LATIN SMALL LETTER A WITH GRAVE..LATIN SMALL LETTER O WITH 
DIAERESIS

Where the first field is a code point (or code point range), the
second field is a property name, and the rest is a comment, including
General Category, count, and names for ease of interpretation
of the code points.

There is nothing in stone here yet -- not the name of the file or
the name of the property or anything else.

However, I want people to look at IDNPermitted.txt to see the
concrete example of exactly what I have been talking about --
a file that simply gives the complete values for a binary
IDN_Permitted property. All code points in the file have the
property; all code points not listed in the file do not.
End of story.

This should be very, very easy to parse. And this kind of
property information can then be compacted into quite small data
structures for runtime table implementations. It would be up
to implementers to decide what techniques that they wanted to
use for that, but the *expression* of the table and the *parsing*
of the table should be absolutely clear, unambiguous,
and easy to test.

--Ken