Reserved general punctuation

Frank Ellermann hmdmhdfmhdjmzdtjmzdtzktdkztdjz at
Fri May 2 10:55:14 CEST 2008

Kenneth Whistler wrote:
> Because U+260E is an *assigned* character with gc=So,
> so it is DISALLOWED.
> And because U+2705 is an *unassigned* code point with
> gc=Cn, so should be UNASSIGNED.

That is the definition at the moment.  I see UNASSIGNED
as an invitation to abuse where it is about code points
that will be never allowed.

I hope to get cases like u+2705 into the DISALLOWED set,
where they can't attract abuse attempts.

> But nothing prevents the UTC or ISO from eventually
> encoding some dingbat (or even some other similar
> symbol) unrelated to a black telephone at U+2705 --
> in which case the cross-reference would be removed
> from the list.

OTOH you already know and guarantee that whatever gets
this code point, if anything, will be DISALLOWED.  It
would be bad to lose this critical info only to get a
simpler definition of UNASSIGNED == unassigned.

>> (7) 2.1.4 "ignorable blocks" is shaky, you said that it
>>           is no good idea to rely on blocks, correct ?
> The list of blocks in 2.1.4 are simply a convenience
> for getting a listing of characters unsuitable for IDNs
> into the DISALLOWED category, without having to list
> them all one-by-one.

Clear, but if using "pattern syntax" instead of a complex
set of rules including blocks, which as Mark said are not
guaranteed to be stable, gets us in essence the same 
desired output, then I'm all for the simpler approach.

Maybe "pattern syntax" can simply replace 2.1.3 + 2.1.4,
and cover other oddities which should be DISALLOWED.

The definition "can never be used in an identifier" is
apparently perfect for the IDNAbis purposes.  There are
even rules about creating profiles for fine tuning, so
why not build on this work ?

 [Musical Symbols]
> some of them are combining marks that have a 
> General_Category value that is not excluded by rule. 
> It is easier and more comprehensible to just exclude
> the entire block

Okay, under the tables-00 rules.  I'm not fluent with
set operations on TUS categories, would rules based on
"pattern syntax" catch the "musical combining marks" ?

> IMO, all unassigned code points in Unicode should be
> UNASSIGNED in the table. That is simply clearer and less
> confusing to people than having some unassigned code
> points UNASSIGNED and some DISALLOWED based on obscure
> criteria that they will have to parse out of the
> statement of the algorithm for deriving the table values.

At the moment somewhat obscure.  With "pattern syntax" it
could be clear enough:  never identifier => never IDNAbis.

And I very much prefer DISALLOWED wherever it is possible,
UNASSIGNED is a wide field and moving target.

> there isn't a difference in terms of what IDNs are valid
> for a given version table. The difference is in clarity
> of expression of the table for implementers.

Please correct me if that's wrong, but I think DISALLOWED
means "yes, you can put this in ROM designed to work for
some decades", while UNASSIGNED merely means "please do
not use at the moment".  TLD registrars are expected to
be strict about UNASSIGNED, but everybody else is free to
try something else, and some folks will be up to no good.

I have no clear vision what those folks could do with say
an "unassigned black telephone", but I'm sure they figure
something out.  Maybe it gives them xn--cocacola.  After I
found Simon's nice libidn online tool it did not take me
long to find that U-label u+7cba u+7b80 is "for me", just
a rather harmless abuse example.  It also did not take me
long to find the "disable IDNA display" browser switch. 

> Here is the summary:
[... thanks ...]

> 1D127       unassigned in Unicode 5.1 (gc=Cn)
>             Default_Ignorable_Code_Point=False
>             Pattern_Syntax=False

Okay, "pattern syntax" doesn't do what I want for u+1D127,
too bad, or is this something that could be fixed in 5.2 ?

> UNASSIGNED --> DISALLOWED is an anticipated and expected
> transition for the table for future versions of Unicode.

Yes, but it is not the only possible transition, otherwise
we could join the sets and be done with it.  


More information about the Idna-update mailing list