Reserved general punctuation

Kenneth Whistler kenw at sybase.com
Thu May 1 20:41:09 CEST 2008


Patrik asked:
 
> > In Unicode, what we've been referring to as "unassigned" (more  
> > precisely
> > gc=Cn) means that a code point (from 0 to 10FFFF) is not assigned  
> > **to a
> > character**.
> 
> In what file of the Unicode distribution can I find every codepoint  
> that have gc=Cn?

http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt

Right at the top of that file, in fact.

Also if you examine the listing carefully, you will see that
while most of the gc=Cn characters are "reserved", all
of the noncharacters are also among the list. For example:

FFEF..FFF8  ; Cn #  [10] <reserved-FFEF>..<reserved-FFF8>

but

FFFE..FFFF  ; Cn #   [2] <noncharacter-FFFE>..<noncharacter-FFFF>

The place to get the *concise* listing of all the noncharacters
is:

http://www.unicode.org/Public/UNIDATA/PropList.txt

and search down for "Noncharacter_Code_Point".

> 
> Is that the same as the codepoints that are missing from  
> UnicodeData.txt? (I know about the "first", "last" issues...)

Correct. No gc=Cn code points are listed in UnicodeData.txt.

--Ken



More information about the Idna-update mailing list