How to know what codepoints are unassigned

Frank Ellermann hmdmhdfmhdjmzdtjmzdtzktdkztdjz at gmail.com
Sun May 4 05:01:43 CEST 2008


Paul Hoffman wrote:

>> AFAIK they are not going to change, no additions,
>> no substractions, forever.
[...]
> While it seems very likely that all these will be
> non-characters forever, other non-characters could
> be added in the future.

Here's how table F.1 in TUS 5.0 puts it:

Applicable versions
| Unicode 3.1+
Constraints
| The Noncharacter_Code_Point property is an immutable
| code point property, which means that its property
| values for all Unicode code points will never change.

Once a non-character, forever a non-character.  Once
not a non-character forever not a non-character.  For
a binary property that covers all code points, or is
there a trick to add more non-characters ?

The magic word "immutable" is also associated with the
Pattern_Syntax and Pattern_Whitespace properties since
version 4.1 in appendix F (encoding stability policies
for TUS).

Obviously you couldn't use Unicode 4.1 for IDNA-2003.

> If we use the process of identifying unassigned 
> codepoints first, then additionally prohibiting
> noncharcters, we don't need to use the logic you
> have listed here, do we?

Dunno, if you somehow get 2048 + 66 + private use + tag
characters as DISALLOWED it is a start.  It could be a
nice plausibility check to design the algorithm in a 
way that outputs UNASSIGNED *last* (not first) with the
check "if any UNASSIGNED isn't unassigned throw a fatal
error".

 Frank



More information about the Idna-update mailing list