Reserved general punctuation

Kenneth Whistler kenw at sybase.com
Thu May 1 02:35:35 CEST 2008


Let me step in and take a crack at this. I think at this point
you two are talking past each other and confusing everybody
on the list.

> At 1:38 PM -0700 4/30/08, Mark Davis wrote:
> >It *is* related to Noncharacters. Default_Ignorable_Code_Point is a 
> >derived property. The code points that are unassigned (gc=Cn) but 
> >that should be DISALLOWED are all and only the Noncharacters.

What Mark is saying can be boiled down to the following
observation. The draft-ietf-idna-tables-00.txt currently
contains the following entry in the table:

200E..2071  ; DISALLOWED  # LEFT-TO-RIGHT MARK..SUPERSCRIPT LATIN SMALL

That is incorrect, because there are unassigned characters
in that range. The correct entries in the table should be (for
Unicode 5.1):

200E..2064  ; DISALLOWED  # LEFT-TO-RIGHT MARK..INVISIBLE PLUS
2065..2069  ; UNASSIGNED  # <reserved>..<reserved>
206A..2071  ; DISALLOWED  # INHIBIT SYMMETRIC SWAPPING..SUPERSCRIPT LATI

O.k. Are we all on board with that?

If so, that means that there is either a bug in the statement
of the various property-related classes or in the algorithm
for the table derivation, or both.

> 
> Then I'm really confused. From the new draft:
> 
> 2.1.3.  IgnorableProperties (C)
> 
>     C: property(cp) is in {Default_Ignorable_Code_Point, White_Space,
>                            Noncharacter_Code_Point}
> 
>     This category is used to group codepoints that are not recommended
>     for use in identifiers.  In general, these codepoints are not
>     suitable for use for IDN.
> 
>     The definition for Default_Ignorable_Code_Point can be found in
>     DerivedCoreProperties.txt [1] (and erratum of 2007-January-25 [2])
>     and is
> 
>     Other_Default_Ignorable_Code_Point + Cf + Cc + Cs
>     + Noncharacter_Code_Point + Variation_Selector
>     - White_Space - FFF9..FFFB (Annotation Characters)

O.k. next problem. Mark pointed out that 2.1.3 itself has not been
correctly updated, because it still reflects a statement as of
Unicode 5.0 (plus an erratum notice), instead of Unicode 5.1.

The definition of Default_Ignorable_Code_Point for Unicode 5.0 is:

# Derived Property: Default_Ignorable_Code_Point
#    Other_Default_Ignorable_Code_Point 
#  + Cf 
#  + Cc + Cs 
#  + Noncharacter_Code_Point
#  + Variation_Selector
#  - White_Space 
#  - FFF9..FFFB (Annotation Characters)

The definition of Default_Ignorable_Code_Point for Unicode 5.1 is:

#    Other_Default_Ignorable_Code_Point
#  + Cf (Format characters)
#  + Variation_Selector
#  - White_Space
#  - FFF9..FFFB (Annotation Characters)
#  - 0600..0603, 06DD, 070F (exceptional Cf characters that should be visible)

So, this means that the last two paragraphs of 2.1.3 need to
be updated to provide the *correct* definition of
Default_Ignorable_Code_Point.

> 
> Why have what whole list of things for "Default_Ignorable_Code_Point" 
> if all we want is Noncharacter_Code_Point, which is already in the 
> list for C? Why not have it at all?

First, I presume that is a typo for "Why have it at all?"

Second, we aren't after *just* noncharacters. Those should certainly
be disallowed, but Default_Ignorable_Code_Point picks up all
the gc=Cf characters that should be DISALLOWED as well.

So what is the problem here?

The problem is an ordering problem in the application of rules
for deriving the table. If you work through the actual derivation
of Default_Ignorable_Code_Point, you get the results which
are explicitly listed in DerivedCoreProperties.txt for Unicode 5.1:

2060..2064    ; Default_Ignorable_Code_Point # Cf   [5] WORD JOINER..INVISIBLE 
PLUS
2065..2069    ; Default_Ignorable_Code_Point # Cn   [5] 
<reserved-2065>..<reserved-2069>
206A..206F    ; Default_Ignorable_Code_Point # Cf   [6] INHIBIT SYMMETRIC 
SWAPPING..NOMINAL DIGIT SHAPES

The entire range 2060..206F is Default_Ignorable_Code_Point=True,
but, 2060..2064 and 206A..206F are assigned characters (and gc=Cf),
whereas 2065..2069 are *NOT* assigned characters, and hence
are gc=Cn.

The intent of the table derivation for draft-ietf-idna-tables-00.txt,
as I understand it is that all unassigned code points should
always be UNASSIGNED in the table, regardless of what other
properties they might have in Unicode data files.

Therefore, we have an ordering bug in the algorithm that is
deriving the table, because it has decided that 2065..2069
should be DISALLOWED, based on their occurring in the
class defined by category C, when they clearly should be
UNASSIGNED, based on their status as unassigned in Unicode 5.1.

O.k., that is complicated, I realize, but I hope that *now* it
is clear what the problems are, at least.

--Ken



More information about the Idna-update mailing list