Reserved general punctuation

Mark Davis mark.davis at icu-project.org
Thu May 1 03:59:00 CEST 2008


I wanted to clear up a few additional items, because I think the terminology
is getting in the way.

*1. unassigned*
In Unicode, what we've been referring to as "unassigned" (more precisely
gc=Cn) means that a code point (from 0 to 10FFFF) is not assigned **to a
character**. The code point may actually have properties even though it does
not represent a character: it might have bidi properties, block properties,
or, as in this case, be default-ignoreable or a noncharacter. Sometimes a
code point is called an "unassigned character" when what is meant is that it
is a code point that is not assigned to be a character.

*2. noncharacter*
The term "noncharacter" is not the same as "unassigned character". Instead,
they are a handful of special entities that can best be thought of as
"super-private-use" code points, intended for internal use but not for
interchange. They are and always will be unassigned (gc=Cn), but there are
many (hundreds of thousands of) other gc=Cn that are not noncharacters.

*2. the problem*
The original email that sparked this whole discussion was my noting a
problem "Other than the Cf issue, I found one other thing. There are
<reserved> characters (that is, General_Category=Cn) that show up as
DISALLOWED when they shouldn't.

2064..2069  ; DISALLOWED  # <reserved>..<reserved>"

I think this may have been misunderstood as meaning that these were the only
code points that had this problem. That is not true, this is one of many
other cases in tables-05. Another one, in particular, is:

FFFC..FFFF ; DISALLOWED # OBJECT REPLACEMENT CHARACTER..<reserved>

Note that FFFF is a noncharacter.

The tag "<reserved>" is attached to code points that are not assigned, so
any line in tables that contains DISALLOWED...<reserved> is a case where a
Unicode unassigned character (gc=Cn) is being categorized as DISALLOWED.

*4. default ignorables*
Assigned default ignorable characters are are those are normally invisible
if not supported, and thus should be DISALLOWED. There are also default
ignorable *unassigned* characters, such as 2069 above. Those are in areas
reserved for future Default Ignorables. (Default ignorables used to contain
noncharacters, but don't in U5.1, for reasons explained earlier.)



So, where do we go from here? I'll try to set out the options.

*A. First test for unassigned:*

gc=Cn => UNASSIGNED
(default ignorable & gc!=Cn) => DISALLOWED

Advantage: it means that Unicode unassigned = UNASSIGNED, which is
conceptually simpler for people. Functionally, this works, because
noncharacters will never change from UNASSIGNED and thus never be allowed in
labels, and unassigned default ignorable characters will become DISALLOWED
as soon as they are assigned.


*B. First test for noncharacters, then unassigned*

noncharacters => DISALLOWED
(default ignorable & gc!=Cn)  => DISALLOWED
(gc=Cn - noncharacters) => UNASSIGNED

Advantage: we know that noncharacters will never be PVALID so we might as
well indicate that.


*C. First test for noncharacters and default ignorable, then for unassigned*

noncharacters => DISALLOWED
(default ignorable) => DISALLOWED
(gc=Cn - noncharacters - default_ignorable) => UNASSIGNED

Advantage: we know that noncharacters and default ignorables will never be
PVALID so we might as well indicate that.

I actually don't care to much which of these options we choose -
functionally it doesn't make a difference. Here I'm just trying to present a
clearer picture of the situation.

Mark

On Wed, Apr 30, 2008 at 5:35 PM, Kenneth Whistler <kenw at sybase.com> wrote:

> Let me step in and take a crack at this. I think at this point
> you two are talking past each other and confusing everybody
> on the list.
>
> > At 1:38 PM -0700 4/30/08, Mark Davis wrote:
> > >It *is* related to Noncharacters. Default_Ignorable_Code_Point is a
> > >derived property. The code points that are unassigned (gc=Cn) but
> > >that should be DISALLOWED are all and only the Noncharacters.
>
> What Mark is saying can be boiled down to the following
> observation. The draft-ietf-idna-tables-00.txt currently
> contains the following entry in the table:
>
> 200E..2071  ; DISALLOWED  # LEFT-TO-RIGHT MARK..SUPERSCRIPT LATIN SMALL
>
> That is incorrect, because there are unassigned characters
> in that range. The correct entries in the table should be (for
> Unicode 5.1):
>
> 200E..2064  ; DISALLOWED  # LEFT-TO-RIGHT MARK..INVISIBLE PLUS
> 2065..2069  ; UNASSIGNED  # <reserved>..<reserved>
> 206A..2071  ; DISALLOWED  # INHIBIT SYMMETRIC SWAPPING..SUPERSCRIPT LATI
>
> O.k. Are we all on board with that?
>
> If so, that means that there is either a bug in the statement
> of the various property-related classes or in the algorithm
> for the table derivation, or both.
>
> >
> > Then I'm really confused. From the new draft:
> >
> > 2.1.3.  IgnorableProperties (C)
> >
> >     C: property(cp) is in {Default_Ignorable_Code_Point, White_Space,
> >                            Noncharacter_Code_Point}
> >
> >     This category is used to group codepoints that are not recommended
> >     for use in identifiers.  In general, these codepoints are not
> >     suitable for use for IDN.
> >
> >     The definition for Default_Ignorable_Code_Point can be found in
> >     DerivedCoreProperties.txt [1] (and erratum of 2007-January-25 [2])
> >     and is
> >
> >     Other_Default_Ignorable_Code_Point + Cf + Cc + Cs
> >     + Noncharacter_Code_Point + Variation_Selector
> >     - White_Space - FFF9..FFFB (Annotation Characters)
>
> O.k. next problem. Mark pointed out that 2.1.3 itself has not been
> correctly updated, because it still reflects a statement as of
> Unicode 5.0 (plus an erratum notice), instead of Unicode 5.1.
>
> The definition of Default_Ignorable_Code_Point for Unicode 5.0 is:
>
> # Derived Property: Default_Ignorable_Code_Point
> #    Other_Default_Ignorable_Code_Point
> #  + Cf
> #  + Cc + Cs
> #  + Noncharacter_Code_Point
> #  + Variation_Selector
> #  - White_Space
> #  - FFF9..FFFB (Annotation Characters)
>
> The definition of Default_Ignorable_Code_Point for Unicode 5.1 is:
>
> #    Other_Default_Ignorable_Code_Point
> #  + Cf (Format characters)
> #  + Variation_Selector
> #  - White_Space
> #  - FFF9..FFFB (Annotation Characters)
> #  - 0600..0603, 06DD, 070F (exceptional Cf characters that should be
> visible)
>
> So, this means that the last two paragraphs of 2.1.3 need to
> be updated to provide the *correct* definition of
> Default_Ignorable_Code_Point.
>
> >
> > Why have what whole list of things for "Default_Ignorable_Code_Point"
> > if all we want is Noncharacter_Code_Point, which is already in the
> > list for C? Why not have it at all?
>
> First, I presume that is a typo for "Why have it at all?"
>
> Second, we aren't after *just* noncharacters. Those should certainly
> be disallowed, but Default_Ignorable_Code_Point picks up all
> the gc=Cf characters that should be DISALLOWED as well.
>
> So what is the problem here?
>
> The problem is an ordering problem in the application of rules
> for deriving the table. If you work through the actual derivation
> of Default_Ignorable_Code_Point, you get the results which
> are explicitly listed in DerivedCoreProperties.txt for Unicode 5.1:
>
> 2060..2064    ; Default_Ignorable_Code_Point # Cf   [5] WORD
> JOINER..INVISIBLE
> PLUS
> 2065..2069    ; Default_Ignorable_Code_Point # Cn   [5]
> <reserved-2065>..<reserved-2069>
> 206A..206F    ; Default_Ignorable_Code_Point # Cf   [6] INHIBIT SYMMETRIC
> SWAPPING..NOMINAL DIGIT SHAPES
>
> The entire range 2060..206F is Default_Ignorable_Code_Point=True,
> but, 2060..2064 and 206A..206F are assigned characters (and gc=Cf),
> whereas 2065..2069 are *NOT* assigned characters, and hence
> are gc=Cn.
>
> The intent of the table derivation for draft-ietf-idna-tables-00.txt,
> as I understand it is that all unassigned code points should
> always be UNASSIGNED in the table, regardless of what other
> properties they might have in Unicode data files.
>
> Therefore, we have an ordering bug in the algorithm that is
> deriving the table, because it has decided that 2065..2069
> should be DISALLOWED, based on their occurring in the
> class defined by category C, when they clearly should be
> UNASSIGNED, based on their status as unassigned in Unicode 5.1.
>
> O.k., that is complicated, I realize, but I hope that *now* it
> is clear what the problems are, at least.
>
> --Ken
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>



-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080430/68db289d/attachment.html


More information about the Idna-update mailing list