Table Updates

Fri Jun 8 00:27:46 CEST 2007

Folks,

After some discussion about the non-advisability of including
too many historic scripts of marginal current usage in
the IDNA Permitted category, I've regenerated tables
to remove several more scripts from the suggested list of
IDN permitted characters.

My drafts are posted at:

http://www.unicode.org/~whistler/

In particular, the full list of characters permitted by generic
rule, less exclusion by list of scripts, is updated to:

http://www.unicode.org/~whistler/SPInclusionList070607.txt

The exception list of additions (Latin uppercase, middle
dot, etc.) is unchanged at:

http://www.unicode.org/~whistler/SPInclusionAdd070308.txt

The results of concatenating those two together, and sorting
by code point is:

http://www.unicode.org/~whistler/SPInclusion.txt

The same information, expressed as a draft of a property
definition file, is:

http://www.unicode.org/~whistler/IDNPermitted.txt

The same property information, reformatted as a bit
array, to show how the binary property might be implemented
(and hinting at how a property table could be significantly
compressed) is:

http://www.unicode.org/~whistler/BitArray.txt

The IDNNever.txt draft property definition file is unchanged
from the last time I posted an update.

The difference between this update and the previous one is
that more scripts have been removed from the SPInclusion list
(and hence also from the IDNPermitted property). The
exact list removed was based on an assessment of the
unlikelihood of significant current usage in other than
specialized contexts such as liturgical texts, historic
literature, academic studies, and the like.

The exact list of scripts removed for this draft is:

  Buginese, Coptic, Hanunoo, Runic, Syloti Nagri, Syriac,
  Tagbanwa, and Tagalog.

Note, that is the Tagalog *script*, not the Tagalog language,
which is almost exclusively written in the Latin script.

I would also suggest removing the Khutsuri variants of
the Georgian script, since they are also entirely limited
to liturgical use. Modern Georgian makes almost exclusive
use of the caseless Mkhedruli variant of the Georgian
script. However, I haven't made that change yet, since
it would require splitting the script by ranges:

Khutsuri: 2D00..2D25 (note: 10A0..10C5 is already excluded
                      from the list as uppercase forms)

Mkhedruli: 10D0..10FA

Both of those are sc=Geor, so an exclusion based on a script
value would be insufficient to distinguish these two.

--Ken