How to know what codepoints are unassigned

Patrik Fältström patrik at frobbit.se
Sat May 3 05:22:30 CEST 2008


What confuses me is Table 2-3 on Page 27 of Unicode 5.0.0 (http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf 
) which uses for me a slightly different terminology than what you use  
here. According to that table, some Cn actually have assigned  
codepoints.

The table make the following definitions:

Cn+Cs = Not assigned to abstract character

Cn-Noncharacter = Undesignated (unassigned) code point

Note that the only place "unassigned" exists, is clearly NOT for Cn,  
but for a subset of Cn.

In the work I do, I will set UNASSIGNED derived property to Cn- 
Noncharacter. The Noncharacter stays DISALLOWED.

    Patrik

On 3 maj 2008, at 05.11, Mark Davis wrote:

> Let me try again. (And this is not your fault -- I've never liked  
> the way
> that this particular terminology is handled, since it leads to
> misunderstandings.)
>
> Given a code point X:
>
>   1. *gc=Cn* means that there is no *character* assigned for X. That  
> is,
>   X is not an *assigned character*. Note that the long name for  
> gc=Cn is
>   *General_Category=Unassigned* (see
>   http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt)
>   2. *Noncharacter=true* means that X, although *not* assigned as a
>   character, is given a special function. In that sense, it is an  
> *assigned
>   code point*, just not assigned as a character.
>
> There are some other oddities: for example, surrogate code points  
> (gc=Cs)
> are also not assigned characters, but they are not gc=unassigned.  
> Those,
> however, don't seem to cause people much problem conceptually.  
> Functionally,
> as I've said, noncharacters are best thought of as "super private use"
> characters, and they could have been incorporated into the general  
> category
> under that rubric, but that sense evolved over time.
>
> This is all water under the bridge, mostly due to history, as the
> architecture grew in unforeseen ways, and stability policies put  
> into place
> a long time ago prevented changes that would have made it conceptually
> simpler.
>
> Does that help any?
>
> Mark
>
> On Fri, May 2, 2008 at 7:31 PM, Patrik Fältström <patrik at frobbit.se>  
> wrote:
>
>>
>> On 30 apr 2008, at 22.38, Mark Davis wrote:
>>
>> The code points that are unassigned (gc=Cn) but that should be
>>> DISALLOWED are all and only the Noncharacters.
>>>
>>
>> What has confused me all the time here is that I interpreted what  
>> Mark say
>> as if gc=Cn give the unassigned codepoints.
>>
>> That is not true, wich shows that I misunderstood what he here wrote.
>>
>> gc=Cn gives the unassigned codepoints PLUS the Noncharacter ones.
>>
>> So, one can NOT use gc=Cn as a test for unassigned codepoints. It  
>> is more
>> complicated than that.
>>
>>   Patrik
>>
>>
>
>
> -- 
> Mark



More information about the Idna-update mailing list