What rules have been used for the current list of codepoints?
Kenneth Whistler
kenw at sybase.com
Fri Dec 15 00:31:00 CET 2006
> At 14:32 -0800 2006-12-14, Kenneth Whistler wrote:
> >To keep up with Mark, I've updated my own table and posted
> >as:
> >
> >http://www.unicode.org/~whistler/SPInclusionList061214.txt
>
> Lovely.
>
> So we can be clear, on what basis did you draw up this table?
It is based on the rules that Mark just summarized for
Patrik. To be perfectly pedantic ;-), those are, once again:
0. Start with the empty set. For each code point cp from 0 to 0x10FFFF:
1. If generalCategory(cp) is in {Ll, Lu, Lo, Lm, Mn, Mc, Nd}, add cp
2. If NFKC(cp) != cp, remove cp
3. If casefold(cp) != cp, remove cp
4. If defaultIgnorableCodePoint(cp), remove cp
5. If script(cp) in {Xsux, Ugar, Xpeo, Goth, Ital, Cprt, Linb, Phnx, Khar,
Phag, Glag, Shaw, Dsrt, Runr}, remove cp
6. If block(cp) in {Combining_Diacritical_Marks_for_Symbols,
Musical_Symbols, Ancient_Greek_Musical_Notation}, remove cp
N. If cp is in [-A-Z0-9], add cp
One (intended) difference is that I didn't bother adding capital A-Z
back in at the end. We all understand that they get grandfathered
in for input. The other thing to note is that I am assuming
all of Han and all Hangul syllables are in the inclusion list,
but those are omitted to keep the list short and focussed on
the issues at hand.
I haven't done the formal diff yet between my file and Mark's
posted file, to determine if I have made any mistakes in
paring down my list to match his additional criteria.
--Ken
More information about the Idna-update
mailing list