Reserved general punctuation

Kenneth Whistler kenw at sybase.com
Fri May 2 03:22:07 CEST 2008


Frank Ellerman wrote:

> > Pattern_Syntax:
> 
> 
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B:Pattern_Syntax=True:%
5D>
> 
> Good, that covers u+2705 and u+260E.  But tables-00 has
> them as different, u+260E is DISALLOWED, u+2705 is only
> UNASSIGNED.

Because U+260E is an *assigned* character with gc=So,
so it is DISALLOWED.

And because U+2705 is an *unassigned* code point with gc=Cn,
so should be UNASSIGNED.

Incidentally, there is *no* normative relationship between the
unassigned code point U+2705 and the assigned character
U+260E. The cross-references published in the names list
are to assist people in finding dingbats they might be
looking for, based on an assumption about ordering and
repertoire of the Zapf dingbats series 100. But nothing
prevents the UTC or ISO from eventually encoding some
dingbat (or even some other similar symbol) unrelated to a black
telephone at U+2705 -- in which case the cross-reference
would be removed from the list.

>  Bullet 8 (unassigned) matches u+2705, it
> has to match before (bullets 5, 6, or 7 => DISALLOWED).
> 
> (5) 2.1.2 "unstable" is not about unassigned dingbats.
> (6) 2.1.3 "ignorable properties" could do, a definition
>           "not recommended for use in identifiers" is 
>           apparently near to the "pattern syntax" idea.
> (7) 2.1.4 "ignorable blocks" is shaky, you said that it
>           is no good idea to rely on blocks, correct ?

The list of blocks in 2.1.4 are simply a convenience
for getting a listing of characters unsuitable for IDNs
into the DISALLOWED category, without having to list
them all one-by-one. Blocks such as Musical
Symbols are clearly unsuitable in general for IDNs. Most
of them get automatically excluded by virtue of their
General_Category value (gc=So), but some of them are
combining marks that have a General_Category value that
is not excluded by rule. It is easier and more comprehensible
to just exclude the entire block, rather than listing
all the exceptions based on General_Category values.
Nothing of interest to IDNs is ever going to be
encoded in the Musical Symbols block in Unicode.

> 
> At the moment u+1D127 is DISALLOWED matching rule 7, and
> u+266D is also DISALLOWED, unfortunately I don't see why.

This is the same ordering bug as for U+2065..U+2069.

IMO, all unassigned code points in Unicode should be
UNASSIGNED in the table. That is simply clearer and less
confusing to people than having some unassigned code
points UNASSIGNED and some DISALLOWED based on obscure
criteria that they will have to parse out of the
statement of the algorithm for deriving the table values.

Since neither UNASSIGNED or DISALLOWED code points can
be in valid IDNs, there isn't a difference in terms of
what IDNs are valid for a given version table. The difference
is in clarity of expression of the table for implementers.
 
> Your "pattern syntax" magic catches u+266D, the output is
> apparently limited to the BMP, I don't see u+1D127.  Okay,
> one of the excessively stupid questions, could the rules
> for the tables use "pattern syntax" as a simplification ?

I think adding in the Pattern_Syntax property in the IDN
table definition would just add to the confusion.

View the Pattern_Syntax property and its related
stability guarantee (it is an *immutable* property),
as a guarantee by the Unicode Consortium that no letter
or digit of interest to users of IDNs will ever be encoded
in those ranges in the future. The ranges which have
Pattern_Syntax=True will only get more symbols or
punctuation characters encoded in them in the future.
Stability guarantees like that don't have to be baked
into the IDN table derivation -- at some point you simply
have to trust that the character encoding committees
aren't interested in spraying characters at random into
ranges that make no sense.

Here is the summary:

2065..2069  unassigned in Unicode 5.1 (gc=Cn)
            Default_Ignorable_Code_Point=True
            Pattern_Syntax=False
            
2705        unassigned in Unicode 5.1 (gc=Cn)
            Default_Ignorable_Code_Point=False
            Pattern_Syntax=True
            
1D127       unassigned in Unicode 5.1 (gc=Cn)
            Default_Ignorable_Code_Point=False
            Pattern_Syntax=False
            
The cleanest solution, *by far*, is to simply make all
of these UNASSIGNED in the IDN table.

The only transitions you will ever see for these in the
future -- assuming any encoding ever occurs at those
code points -- would be:

2065..2069  assigned to some format character in Unicode n.m (gc=Cf)
            Default_Ignorable_Code_Point=True
            Pattern_Syntax=False
            
2705        assigned to some dingbat-like symbol in Unicode n.m (gc=So)
            Default_Ignorable_Code_Point=False
            Pattern_Syntax=True
            
1D127       assigned to some musical symbol in Unicode n.m (gc=So,Mn,or??)
            Default_Ignorable_Code_Point=False
            Pattern_Syntax=False
            
By the current class definitions and rules, any of those transitions
would make these DISALLOWED in the IDN table.

UNASSIGNED --> DISALLOWED is an anticipated and expected
transition for the table for future versions of Unicode.

I don't see how bringing Pattern_Syntax into the equation
would make this any clearer for implementers of the table.

--Ken



            



More information about the Idna-update mailing list