What rules have been used for the current list of codepoints?

Kenneth Whistler kenw at sybase.com
Fri Dec 15 00:31:00 CET 2006


> At 14:32 -0800 2006-12-14, Kenneth Whistler wrote:
> >To keep up with Mark, I've updated my own table and posted
> >as:
> >
> >http://www.unicode.org/~whistler/SPInclusionList061214.txt
> 
> Lovely.
> 
> So we can be clear, on what basis did you draw up this table?

It is based on the rules that Mark just summarized for
Patrik. To be perfectly pedantic ;-), those are, once again:

0. Start with the empty set. For each code point cp from 0 to 0x10FFFF:
1. If generalCategory(cp) is in {Ll, Lu, Lo, Lm, Mn, Mc, Nd}, add cp
2. If NFKC(cp) != cp, remove cp
3. If casefold(cp) != cp, remove cp
4. If defaultIgnorableCodePoint(cp), remove cp
5. If script(cp) in {Xsux, Ugar, Xpeo, Goth, Ital, Cprt, Linb, Phnx, Khar,
Phag, Glag, Shaw, Dsrt, Runr}, remove cp
6. If block(cp) in {Combining_Diacritical_Marks_for_Symbols,
Musical_Symbols, Ancient_Greek_Musical_Notation}, remove cp
N. If cp is in [-A-Z0-9], add cp

One (intended) difference is that I didn't bother adding capital A-Z
back in at the end. We all understand that they get grandfathered
in for input. The other thing to note is that I am assuming
all of Han and all Hangul syllables are in the inclusion list,
but those are omitted to keep the list short and focussed on
the issues at hand.

I haven't done the formal diff yet between my file and Mark's
posted file, to determine if I have made any mistakes in
paring down my list to match his additional criteria.

--Ken



More information about the Idna-update mailing list