idnabis tables feedback

Tue Feb 12 04:07:57 CET 2008

http://tools.ietf.org/id/draft-faltstrom-idnabis-tables-04.txt

   o  PROTOCOL VALID: Those that are allowed to be used in IDNs.
      Codepoints with this property value are permitted for general use

We were calling this just ALLOWED, which seems a better name -- and is
what is actually used in the first line to define it: "Those that are
allowed to be used in IDNs."

And using ALLOWED means that we don't also need the abbreviation
PVALID, and it is then parallel to DISALLOWED. One could argue that
ALLOWED is not strictly accurate, since  the CONTEXT are allowed
(under constraints), but VALID has the same issue. Why change to
PROTOCOL VALID / PVALID? And if so, why not change DISALLOWED to
INVALID?

      Once assigned to this category, a character is
      never removed from it unless it is removed from Unicode.

I believe that this is a holdover -- that characters can move from
this to CONTEXT (or DISALLOWED) in case of disaster.

      document.  There are two subdivisions of CONTEXTUAL RULE REQUIRED,
      one for Join_controls (called CONTEXTJ) and and for other
      characters (called CONTEXTO).  These are discussed in more detail

I don't see why there is a separation between these. Don't they behave
the same in the protocol? Will have to look at the protocol doc more
closely.

   The (non-normative) table in Appendix A is derived from data in
   Unicode 5.0, rather than the earlier Unicode 3.2; this in order to
   take advantage of the expanded character repertoire and better

Unicode 5.1 is to be released in March. There are some specific
changes in it that are useful for IDN in terms of reference, so it
would be much better for use that version. I suggestion adding an
editorial note to the effect that "We expect to update to Unicode 5.1
before publication."

2.1.1.  Category A - Classes of Codepoints

This is a misleading name for this section. Other things are "Classes
of characters". I suggest "General Categories Allowed" or something
like that.

   B: NFKC(casefold(NFKC(cp))) != cp

=>
toNFKC(toCaseFolded(toNFKC(cp))) != cp

I'll get you references.

   C: property(cp) is in {Default_Ignorable_Code_Point, White_Space}
=>
   C: property(cp) is in {Default_Ignorable_Code_Point, White_Space,
Noncharacter_Code_Point}

The actual definition of Default_Ignorable_Code_Point are those code
points that should be ignored in rendering, if they are not supported
by an implementation. It is basically invisible characters (and there
are some fixes in U5.1 to make it hew more precisely to that
definition). The main purpose of adding Default_Ignorable_Code_Point
is to exclude variation selectors, since it is otherwise covered by
other rules.

The addition of White_Space is new, and not needed. White_Space are
either General Category C or Z, both of which are excluded by Category
A. Now, if you want to keep it, it doesn't hurt anything, but it isn't
necessary.

As Eric said, we should add Noncharacter_Code_Point explicitly.

   While three of the characters (02B9, 0483 and 0375), plus Geresh and
   Gershayim, appear to be special rules based on picking characters one
   at a time, they actually reflect a character property that is not
   (yet) defined for Unicode.  That character property might be
   described as "indicates a numeric use in a script for which numbers
   are represented by treating the letters (in collation order) as
   digits".  Were that property to be created, these characters could be
   removed from Category F and assigned to a separate category based on
   the property.

This paragraph seems to imply that the consortium is on the path to
define a property like "indicates a numeric use..." and that that
property encompasses those characters.

There's been no proposal for any such property, nor do I think it
would be of particular interest. As far as I know, the usage for
marking numbers is archaic, and certainly not needed for identifiers.
The "collation" order is misleading, since it is not modern collation
order -- and at least for Greek, the number system requires the use of
archaic characters. Ken might comment more.

Now, I don't think these characters really belong, but I'll comment on
that elsewhere. Just in terms of cleaning up the paragraph so as to
remove the problematic parts, you could replace the above by something
like:

=>
The following characters are used in different scripts to indicate
that an adjacent letter is being used with a numeric value.

U+02B9 ( ʹ ) MODIFIER LETTER PRIME
U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN
U+0483 ( ҃ ) COMBINING CYRILLIC TITLO

2.2.5.  Category I - Require special treatment in Lookup and extended
        special treatment in Resolution

   I: generalCategory(cp) is in {Cf}

You should remove this. There is no need for any other Cf characters
than the Join controls, as we discussed in January.

Section 3.
Right now, you have single-letter names like Category A, and use them
in Section 3 for calculation. So we see text like:

   o  If the codepoint is in Category F (Section 2.2.2), the value is
      according to the table in Section 2.2.2.
   o  If the codepoint is in Category G (Section 2.2.3), the value is
      according to the table in Section 2.2.3.
...
   o  If the codepoint is not in Category A (Section 2.1.1), the value
      is DISALLOWED.

In this day and age, we don't have to save on letters in a
specification ;-). I've remarked on this several times, as have
others. We haven't seen any reasons given for hewing to Fortran-style
variable names!

Please rewrite the category names to use meaningful names instead of
single letters. There is also some old language like "   First the
special cases.  If there is a match, do not go to the second phase.
...according to the table", which is no longer applicable given the
restructuring after the January meeting.

Thus the above would become:

   o  If the codepoint is in the Category Exceptions (Section 2.2.2),
the value is PVALID.
   o  If the codepoint is in Category Backward_Compatibility (Section
2.2.3), the value is PVALID.
...

and so on.
-- 
Mark