idnabis-tables-04 problem #1: Inconsistencies in category definitions

Patrik Fältström patrik at frobbit.se
Tue Dec 9 12:41:41 CET 2008


Thanks Ken.

If no one objects, I will make these changes.

    Patrik

On 6 dec 2008, at 00.52, Kenneth Whistler wrote:

> Patrik, et al.,
>
> I'm going to provide a series of analyses of detailed
> textual (and technical) problems in tables-04, but I'm
> breaking them out into separate notes, organized by
> micro-topic, so that (hopefully) any follow-up
> discussion won't pick up one nit out of a dozen topics
> and wander with it.
>
> Problem #1: Inconsistencies in category definitions
>
> This has to do specifically with the way each of the
> category definitions in Sections 2.1 through 2.10 is
> stated. To make it easier to follow, I'll cite each
> of the specific wordings below, and then give the
> way I think it needs to be corrected, along with the
> explanation of what is wrong with the current
> formulation (if not self-evident).
>
> Note that in for this particular problem, the intent
> here is *not* to change the derivation in any way,
> but merely to fix the imprecision and inconsistency
> in the current formulation of these category definitions.
>
> --Ken
>
> ==========================================================
>
> 2.1 LetterDigits (A)
>
> Current:
>
> A: generalCategory(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}
>
> Suggested fix:
>
> A: General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}
>
> Rationale:
>
> As will be evident in more problematical
> formulations in subsequent sections, I think the cleanest
> and most self-evident way for a specification to
> designate a functional operation which returns the
> property value of a particular property for a code point
> is simply to tack "(cp)" onto the formal name of the
> Unicode property, rather than creating an arbitrarily
> named function and then describing (or not describing) in
> subsequent text what it is supposed to mean.
>
> This then entails deleting the following sentence in Section
> 2.1:
>
> "The generalCategory() operation returns the General Category
> for a particular Unicode code point."
>
> And replacing it with a generic statement as the last
> paragraph of Section 2, above Section 2.1:
>
> "In the following specification of categories, the operation
> which returns the value of a particular Unicode character
> property for a code point is designated by using the
> formal name of that property (from PropertyAliases.txt)
> followed by '(cp)'. For example, the value of the
> General_Category property for a code point is indicated
> by General_Category(cp)."
>
> Once you have done that, then the rest of the formulations
> can be done consistently.
>
> ==========================================================
>
> 2.2 Unstable (B)
>
> Current:
>
> B: toNFKC(toCasefold(toNFKC(cp))) != cp
>
> Suggested fix:
>
> None needed. This is o.k. These are not property names, but
> functional operations defined elsewhere.
>
> However, in the third paragraph, for consistency:
>
> toCaseFold...
> -->
> The toCasefold() operation...
>
> ==========================================================
>
> 2.3 IgnorableProperties (C)
>
> Current:
>
> C: property(cp) is in {Default_Ignorable_Code_Point, White_Space,
>                       Noncharacter_Code_Point}
>
> Suggested:
>
> C: Default_Ignorable_Code_Point(cp) = True
>   or
>   White_Space(cp) = True
>   or
>   Noncharacter_Code_Point(cp) = True
>
> Rationale:
>
> There is currently a false analogy being made in the formulation
> of this category, in the attempt to make all of these
> formulations look like equivalent types of set operations.
> In this case, however, "property(cp)" is not a defined
> property evaluation at all, nor is it well-defined as
> a function on a code point -- Unicode code points actually
> have multiple properties, and *every* code point has a
> property value for White_Space, for example. So the
> current formulation is erroneous, misleading, and
> non-parallel to the use of General_Category(cp) in 2.1.
>
> The current text is implicitly trying to answer the
> question, "Which Unicode character properties, which if
> true for a code point, let us decide that a character
> can be treated as 'ignorable' and thereby appropriate
> for classification as DISALLOWED for the algorithm?"
> But turning *that* question into pseudo-set notation is
> not the correct way to define a set of code points,
> whereas my suggested reformulation is, and is consistent
> then with 2.1.
>
> ==========================================================
>
> 2.4 IgnorableBlocks (D)
>
> Current:
>
> D: block(cp) in {Combining Diacritical Marks for Symbols,
>                 Musical Symbols, Ancient Greek Musical Notation}
>
> Suggested:
>
> D: Block(cp) is in {Combining Diacritical Marks for Symbols,
>                    Musical Symbols, Ancient Greek Musical Notation}
>
> Rationale:
>
> Capitalize "Block" for consistency with other property name
> usages. Add "is" for consistency with the formulation of
> the other set notations for multiple property values.
> Otherwise this is o.k.
>
> ==========================================================
>
> 2.5 LDH (E)
>
> Current:
>
> E: cp is in {002D, 0030..0039, 0061..007A}
>
> O.k. No problem.
>
> ==========================================================
>
> 2.6 Exceptions (F)
>
> Current:
>
> F: cp in {002D, 00B7, 00DF, 02B9, 0375, 0483, 05F3, 05F4, 06FD,
>          06FE, 0F0B, 3005, 3007, 302E, 302F, 303B, 30FB}
>
> Suggested:
>
> F: cp is in {002D, 00B7, 00DF, 02B9, 0375, 0483, 05F3, 05F4, 06FD,
>             06FE, 0F0B, 3005, 3007, 302E, 302F, 303B, 30FB}
>
> Add "is" for consistency.
>
> ==========================================================
>
> 2.7 BackwardCompatible (G)
>
> Current:
>
> G: cp in {}
>
> Suggested:
>
> G: cp is in {}
>
> Add "is" for consistency.
>
> ==========================================================
>
> 2.8 JoinControl (H)
>
> Current:
>
> H: property(cp) is in {Join_Control}
>
> Suggested:
>
> H: Join_Control(cp) = True
>
> Rationale:
>
> This is the same type of problem as for 2.3.
>
> ==========================================================
>
> 2.9 OldHangulJamo (I)
>
> Current:
>
> I: HangulSyllableType(cp) is in {L, V, T}
>
> Suggested:
>
> I: Hangul_Syllable_Type(cp) is in {L, V, T}
>
> Rationale:
>
> Use the formal name of the property here.
>
> ==========================================================
>
> 2.10 Unassigned (J)
>
> Current:
>
> J: cp is in {Cn} and property(cp) is not in {Noncharacter_Code_Point}
>
> Suggested:
>
> J: General_Category(cp) is in {Cn}
>   and
>   Noncharacter_Code_Point(cp) = False
>
> Rationale:
>
> This definition was doubly inconsistent, because for the first
> part a code point is not *in* a set of values of a property,
> and the second part has the same generic "property(cp)" problem
> noted above for 2.3 and 2.8.
>
> ==========================================================
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>



More information about the Idna-update mailing list