Reserved general punctuation

Kenneth Whistler kenw at sybase.com
Fri May 2 20:22:07 CEST 2008


Frank Ellermann said:

> That is the definition at the moment.  I see UNASSIGNED
> as an invitation to abuse where it is about code points
> that will be never allowed.
> 
> I hope to get cases like u+2705 into the DISALLOWED set,
> where they can't attract abuse attempts.

Can you provide a clear example of what kind of abuse
you envision?

As I see it, U+2705, under the rules proposed here,
cannot be in an IDN in 2008, or in 2018, or even if
encoded as some kind of symbol dingbat in a distant
version of Unicode in 2028, in an IDN in 2028.

I just don't see the marginal value here of trying
to take some specific ranges of unassigned code points
in Unicode and explicitly designating them in IDNA 2008
as more toxic than ordinary unassigned code points. It
just seems to invite confusion about the status of
code points in the table -- and I view confusion as
the more likely cause of attracting abuse, rather than
anything specific about U+2705.


> Maybe "pattern syntax" can simply replace 2.1.3 + 2.1.4,
> and cover other oddities which should be DISALLOWED.

It doesn't cover everything you wanted to cover -- and
in particular not the blocks of 2.1.4.


> Okay, under the tables-00 rules.  I'm not fluent with
> set operations on TUS categories, would rules based on
> "pattern syntax" catch the "musical combining marks" ?

No, they would not.


> And I very much prefer DISALLOWED wherever it is possible,
> UNASSIGNED is a wide field and moving target.
> 
> > there isn't a difference in terms of what IDNs are valid
> > for a given version table. The difference is in clarity
> > of expression of the table for implementers.
> 
> Please correct me if that's wrong, but I think DISALLOWED
> means "yes, you can put this in ROM designed to work for
> some decades", while UNASSIGNED merely means "please do
> not use at the moment".  TLD registrars are expected to
> be strict about UNASSIGNED, but everybody else is free to
> try something else, and some folks will be up to no good.
> 
> I have no clear vision what those folks could do with say
> an "unassigned black telephone", but I'm sure they figure
> something out.  

There are hundreds of thousands of reserved, unassigned
code points in Unicode. If people are going to make
mischief with them by fiddling with Punycode labels,
I don't see much in the way of incremental gains to
be had in the protocol definition by worrying about
what they might do with the reserved, unassigned code
point U+2705, as opposed to the reserved, unassigned
code point U+170D in the Tagalog block or the reserved,
unassigned code point U+4DBF in the CJK Unified Ideographs
Extension A block, for example.

Oh, and I reiterate: U+2705 is not an "unassigned black
telephone" -- it is simply one more reserved, unassigned
code point.

> > 1D127       unassigned in Unicode 5.1 (gc=Cn)
> >             Default_Ignorable_Code_Point=False
> >             Pattern_Syntax=False
> 
> Okay, "pattern syntax" doesn't do what I want for u+1D127,
> too bad, or is this something that could be fixed in 5.2 ?

No, precisely because the Pattern_Syntax property is
an *immutable* property. The UTC has already specified
that the set of code points with that property can neither
be extended or reduced. That was part of the point in
defining that property in the first place.

--Ken




More information about the Idna-update mailing list