idnabis-tables-04 problem #1: Inconsistencies in category definitions

Kenneth Whistler kenw at sybase.com
Sat Dec 6 00:52:33 CET 2008


Patrik, et al.,

I'm going to provide a series of analyses of detailed
textual (and technical) problems in tables-04, but I'm
breaking them out into separate notes, organized by
micro-topic, so that (hopefully) any follow-up
discussion won't pick up one nit out of a dozen topics
and wander with it.

Problem #1: Inconsistencies in category definitions

This has to do specifically with the way each of the
category definitions in Sections 2.1 through 2.10 is
stated. To make it easier to follow, I'll cite each
of the specific wordings below, and then give the
way I think it needs to be corrected, along with the
explanation of what is wrong with the current
formulation (if not self-evident).

Note that in for this particular problem, the intent
here is *not* to change the derivation in any way,
but merely to fix the imprecision and inconsistency
in the current formulation of these category definitions.

--Ken

==========================================================

2.1 LetterDigits (A)

Current:

A: generalCategory(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}

Suggested fix:

A: General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}

Rationale: 

As will be evident in more problematical
formulations in subsequent sections, I think the cleanest
and most self-evident way for a specification to
designate a functional operation which returns the
property value of a particular property for a code point
is simply to tack "(cp)" onto the formal name of the
Unicode property, rather than creating an arbitrarily
named function and then describing (or not describing) in
subsequent text what it is supposed to mean.

This then entails deleting the following sentence in Section
2.1:

"The generalCategory() operation returns the General Category
for a particular Unicode code point."

And replacing it with a generic statement as the last
paragraph of Section 2, above Section 2.1:

"In the following specification of categories, the operation
which returns the value of a particular Unicode character
property for a code point is designated by using the
formal name of that property (from PropertyAliases.txt)
followed by '(cp)'. For example, the value of the
General_Category property for a code point is indicated
by General_Category(cp)."

Once you have done that, then the rest of the formulations
can be done consistently.

==========================================================

2.2 Unstable (B)

Current:

B: toNFKC(toCasefold(toNFKC(cp))) != cp

Suggested fix:

None needed. This is o.k. These are not property names, but
functional operations defined elsewhere.

However, in the third paragraph, for consistency:

toCaseFold...
-->
The toCasefold() operation...

==========================================================

2.3 IgnorableProperties (C)

Current:

C: property(cp) is in {Default_Ignorable_Code_Point, White_Space,
                       Noncharacter_Code_Point}
                       
Suggested:

C: Default_Ignorable_Code_Point(cp) = True
   or
   White_Space(cp) = True
   or
   Noncharacter_Code_Point(cp) = True
   
Rationale:

There is currently a false analogy being made in the formulation
of this category, in the attempt to make all of these
formulations look like equivalent types of set operations.
In this case, however, "property(cp)" is not a defined
property evaluation at all, nor is it well-defined as
a function on a code point -- Unicode code points actually
have multiple properties, and *every* code point has a
property value for White_Space, for example. So the
current formulation is erroneous, misleading, and
non-parallel to the use of General_Category(cp) in 2.1.

The current text is implicitly trying to answer the
question, "Which Unicode character properties, which if
true for a code point, let us decide that a character
can be treated as 'ignorable' and thereby appropriate
for classification as DISALLOWED for the algorithm?"
But turning *that* question into pseudo-set notation is
not the correct way to define a set of code points,
whereas my suggested reformulation is, and is consistent
then with 2.1.

==========================================================

2.4 IgnorableBlocks (D)

Current:

D: block(cp) in {Combining Diacritical Marks for Symbols,
                 Musical Symbols, Ancient Greek Musical Notation}
                 
Suggested:

D: Block(cp) is in {Combining Diacritical Marks for Symbols,
                    Musical Symbols, Ancient Greek Musical Notation}
                    
Rationale:

Capitalize "Block" for consistency with other property name
usages. Add "is" for consistency with the formulation of
the other set notations for multiple property values.
Otherwise this is o.k.                                     

==========================================================

2.5 LDH (E)

Current:

E: cp is in {002D, 0030..0039, 0061..007A}

O.k. No problem.

==========================================================

2.6 Exceptions (F)

Current:

F: cp in {002D, 00B7, 00DF, 02B9, 0375, 0483, 05F3, 05F4, 06FD,
          06FE, 0F0B, 3005, 3007, 302E, 302F, 303B, 30FB}
          
Suggested:

F: cp is in {002D, 00B7, 00DF, 02B9, 0375, 0483, 05F3, 05F4, 06FD,
             06FE, 0F0B, 3005, 3007, 302E, 302F, 303B, 30FB}
          
Add "is" for consistency.

==========================================================

2.7 BackwardCompatible (G)

Current:

G: cp in {}

Suggested:

G: cp is in {}

Add "is" for consistency.

==========================================================

2.8 JoinControl (H)

Current:

H: property(cp) is in {Join_Control}

Suggested:

H: Join_Control(cp) = True

Rationale:

This is the same type of problem as for 2.3.

==========================================================

2.9 OldHangulJamo (I)

Current:

I: HangulSyllableType(cp) is in {L, V, T}

Suggested:

I: Hangul_Syllable_Type(cp) is in {L, V, T}

Rationale:

Use the formal name of the property here.

==========================================================

2.10 Unassigned (J)

Current:

J: cp is in {Cn} and property(cp) is not in {Noncharacter_Code_Point}

Suggested:

J: General_Category(cp) is in {Cn}
   and
   Noncharacter_Code_Point(cp) = False
   
Rationale:

This definition was doubly inconsistent, because for the first
part a code point is not *in* a set of values of a property,
and the second part has the same generic "property(cp)" problem
noted above for 2.3 and 2.8.   

==========================================================




More information about the Idna-update mailing list