How to know what codepoints are unassigned

Kenneth Whistler kenw at sybase.com
Mon May 5 20:47:20 CEST 2008


John said:

> This means that
> 
> 	* Non-character and reserved code points that have
> 	nothing specifically assigned to them are UNASSIGNED.
> 	
> 	* Non-character code points that have specific
> 	non-characters assigned to them are DISALLOWED (unless
> 	they are exceptions), but by other rules.

Almost but not quite.

Reserved code points are unassigned code points, and should
be UNASSIGNED.

Noncharacter code points are "assigned" code points -- or
in the terminology I prefer, they are *designated* code points,
meaning their function has been designated by the standard
(as other than reserved). Noncharacter code points will *NEVER*
have abstract characters associated with them by the standard,
and thus will never be assigned *characters*. As Frank pointed
out, the Noncharacter_Code_Point property is *immutable*. No
existing noncharacter will ever change to anything else, nor
will any ordinary reserved code point ever be designated
as an additional noncharacter.

So, clarifying what I suggested before:

1. Reserved unassigned ( gc=Cn - Noncharacter ) --> UNASSIGNED.

2. U+200C (ZWNJ), U+200D (ZWJ) --> CONTEXTJ

3. Noncharacters, surrogates (gc=Cs), controls (gc=Cc),
   private-use (gc=Co), other format (gc=Cf) --> DISALLOWED.
   
Put those few rules first and depend on them. Then you can start
figuring out how to categorize all the Graphic characters that
everybody cares about, and which cover all the ordinary,
visible characters.

--Ken




More information about the Idna-update mailing list