What rules have been used for the current list of codepoints?

Kenneth Whistler kenw at sybase.com
Thu Dec 14 01:14:59 CET 2006


> --On 13. desember 2006 08:45 -0800 Kenneth Whistler <kenw at sybase.com> wrote:
> 
> > I have no idea how consensus on this list is measured, but
> > *I* am absolutely sure that Lm and Nd need to be added. In
> > fact, using the formulation you are using here for rules,
> > the whole list of rules should be reconstructed as:
> >
> > 1. If class is [Ll, Lm, Lo, Mn, Mc, Nd], the code point is ok
> > 2. If NFKC(cp) != cp, the code point is not ok
> > 3. If lowercase(cp) != cp, the code point is not ok
> >
> > And that is pretty much exactly what I stated in the November 30
> > contribution.
> >
> 
> Since a character can match rule 1 and also match rule 2 or 3, you have to 
> apply rule 1 last.

O.k., but I don't think a cascading set of matching rules is
the way to do this. 

If you want more formalism, adopt the scheme that Mark suggested,
which is more definitive and unambiguously builds up a set
definition.

> 
> Your list includes LATIN SMALL LETTER SHARP S - I thought that was unstable 
> under NFKC+casemap?

NFKC ( U+00DF ) = U+00DF

lowercase ( U+00DF ) = U+00DF

What *Mark* suggested is stability under casefold(cp), which is
slightly different and a stronger constraint.

casefold ( U+00DF ) = <U+0073, U+0073>

So under *that* criteria, U+00DF would be omitted from the list.

The difference results in the omission of the following additional
characters:

00DF sharp s

5 Latin characters that are decomposable but which do not have
matching uppercase precomposed characters: 01F0, 1E96..1E99.

2 lowercase Greek vowels with diaeresis acute accents: 0390, 03B0

The combining Greek subscript iota: 0345

And a slew of precomposed polytonic Greek letters that have
anomalous casing issues: 1F50, 1F52, ... 1FF6, 1FF7.

Assuming that we should start from Mark's suggestion, which I
concur with, the initial list gets shorter. I have
posted it as:

http://www.unicode.org/~whistler/SPLlLoLmMnMcNdStableCasefoldNFKC.txt

so you can compare. 59 fewer characters are eligible by those
criteria.


> And personally, I think a rule that permits COMBINING ALMOST EQUAL TO ABOVE 
> and MUSICAL SYMBOL COMBINING TREMOLO-1 is useless for this exercise.

By such considerations, a rule (your #1) that omits schwa could also
be characterized as "useless" for this exercise.

I did not say that I thought that SPLlLoLmMnMcNdStableCaseNFKC.txt
(or now SPLlLoLmMnMcNdStableCasefoldNFKC.txt) was the definitive
final list. I said it was the set of candidates that you get
by applying the General_Category (class), NFKC stability
and lowercase (now casefolding) stability criteria.

My November 30 contribution then went on to specify the list of
historic (meaning *archaic* or otherwise unused old scripts
of no current usage or relevance outside scholarly
contexts) scripts that could safely be omitted, to wit:

<quote>
The list could easily be
pared down, for example, by dropping cuneiform scripts:

sc=Xsux  (Sumero-Akkadian cuneiform)
sc=Ugar  (Ugaritic cuneiform)
sc=Xpeo  (Old Persian cuneiform)

other archaic alphabets and syllabaries:

sc=Goth  (Gothic alphabet)
sc=Ital  (Old Italic alphabet)
sc=Cprt  (Cypriot syllabary)
sc=Linb  (Linear-B syllabary)
sc=Phnx  (Phoenician alphabet)
sc=Khar  (Kharoshthi abjad)
sc=Phag  (Phags-pa alphabet)
sc=Glag  (Glagolitic alphabet)

and conscripts with no current usage:

sc=Shaw  (Shavian conscript alphabet)
sc=Dsrt  (Deseret conscript alphabet)
</quote>

Since "could be" seems to get nowhere in this discussion, I'll
change my rhetoric to MUST be, and post the resulting list
precalculated:

http://www.unicode.org/~whistler/SPetcetcHistoricRemoved.txt

That pares down the candidates by another 1532 characters.

Next, I happen to *agree* with you that since we aren't
suggesting that any symbol characters be included in this
list, it doesn't make any sense to include combining marks
whose only felicitous usage is applied to symbols, and which
never occur in orthographies for languages. In *this* case,
one could specify an omission by block, but it is just as
clear to simply omit by ranges for these. In particular:

Omit combining marks in the ranges:

20D0..20EF
1D100..1D1FF
1D200..1D24F

That would catch any possible future additions to those
ranges, which we can be sure wouldn't be combining marks
intended for orthographies.

So those combining marks MUST be omitted, and the continuing
rolled up summary can be found posted at:

http://www.unicode.org/~whistler/SPetcetcSymCombiningRemoved.txt

That removes another 56 characters.

O.k., I'll pause at this point to roll up where I think this stands.
Using Mark's formalism, the set of rules applied so far are:

0. Start with the empty set.
1. If generalCategory(cp) is [Ll, Lo, Lm, Mn, Mc], add cp
2. If NFKC(cp) != cp, remove cp
3. If casefold(cp) != cp, remove cp
4. If script(cp) is [Xsux, Ugar, Xpeo, Goth, Ital, Cprt,
       Linb, Phnx, Khar, Phag, Glag, Shaw, Dsrt], remove cp
5. If cp in {20D0..20EF} or cp in {1D100..1D1FF} or
       cp in {1D200..1D24F}, remove cp

Next, I expect people to examine SPetcetcSymCombiningRemoved.txt
to suggest a few more removal criteria. In my original note,
for example, I suggested that it would be a good idea to
remove Hebrew and Arabic annotation marks, which are only
for marking religious texts for chanting and such, and are not
part of the orthographies. But at a certain point pretty soon,
we get to where the diminishing utility of dropping
more characters for particular reasons will start to meet the
rising curve of expectations from end users about what characters
they can use from "their" scripts. So the more ad hoc decisions
we make to strike characters by function after this point, the
harder it gets to defend the decisions as based on clearly
statable and easy to understand criteria.

Finally we get to:

n. If cp in [A-Z], include cp

--Ken




More information about the Idna-update mailing list