idnabis tables feedback

Patrik Fältström patrik at frobbit.se
Sat Feb 16 23:40:54 CET 2008


On 12 feb 2008, at 04.07, Mark Davis wrote:

> http://tools.ietf.org/id/draft-faltstrom-idnabis-tables-04.txt
>
>  o  PROTOCOL VALID: Those that are allowed to be used in IDNs.
>     Codepoints with this property value are permitted for general use

The description in tables is now changed to be as brief as possible,  
according to the suggestion from Harald in a later email.

>     document.  There are two subdivisions of CONTEXTUAL RULE REQUIRED,
>     one for Join_controls (called CONTEXTJ) and and for other
>     characters (called CONTEXTO).  These are discussed in more detail
>
> I don't see why there is a separation between these. Don't they behave
> the same in the protocol? Will have to look at the protocol doc more
> closely.

This is mentioned at least in the appendix of issues / rationale  
document regarding the contextual rule registry. The difference has to  
do with whether the check is to be done at lookup time or not:

> 3. An indication as to whether the code point requires the rule be  
> processed at lookup time (this indication is equivalent to the  
> difference between "CONTEXTJ" and "CONTEXTO" in the tables document  
> [IDNA200X-Tables]).

John has to explain more details.

>  The (non-normative) table in Appendix A is derived from data in
>  Unicode 5.0, rather than the earlier Unicode 3.2; this in order to
>  take advantage of the expanded character repertoire and better
>
> Unicode 5.1 is to be released in March. There are some specific
> changes in it that are useful for IDN in terms of reference, so it
> would be much better for use that version. I suggestion adding an
> editorial note to the effect that "We expect to update to Unicode 5.1
> before publication."

Added.

> 2.1.1.  Category A - Classes of Codepoints
>
> This is a misleading name for this section. Other things are "Classes
> of characters". I suggest "General Categories Allowed" or something
> like that.

Fixed.

>  B: NFKC(casefold(NFKC(cp))) != cp
>
> =>
> toNFKC(toCaseFolded(toNFKC(cp))) != cp
>
> I'll get you references.

Fixed.

>  C: property(cp) is in {Default_Ignorable_Code_Point, White_Space}
> =>
>  C: property(cp) is in {Default_Ignorable_Code_Point, White_Space,
> Noncharacter_Code_Point}
>
> The actual definition of Default_Ignorable_Code_Point are those code
> points that should be ignored in rendering, if they are not supported
> by an implementation. It is basically invisible characters (and there
> are some fixes in U5.1 to make it hew more precisely to that
> definition). The main purpose of adding Default_Ignorable_Code_Point
> is to exclude variation selectors, since it is otherwise covered by
> other rules.
>
> The addition of White_Space is new, and not needed. White_Space are
> either General Category C or Z, both of which are excluded by Category
> A. Now, if you want to keep it, it doesn't hurt anything, but it isn't
> necessary.
>
> As Eric said, we should add Noncharacter_Code_Point explicitly.

Fixed. White_Space is still there.

I rather keep it there for clarity than "optimizing" just because some  
rules are overlapping.

>  While three of the characters (02B9, 0483 and 0375), plus Geresh and
>  Gershayim, appear to be special rules based on picking characters one
>  at a time, they actually reflect a character property that is not
>  (yet) defined for Unicode.  That character property might be
>  described as "indicates a numeric use in a script for which numbers
>  are represented by treating the letters (in collation order) as
>  digits".  Were that property to be created, these characters could be
>  removed from Category F and assigned to a separate category based on
>  the property.
>
> This paragraph seems to imply that the consortium is on the path to
> define a property like "indicates a numeric use..." and that that
> property encompasses those characters.
>
> There's been no proposal for any such property, nor do I think it
> would be of particular interest. As far as I know, the usage for
> marking numbers is archaic, and certainly not needed for identifiers.
> The "collation" order is misleading, since it is not modern collation
> order -- and at least for Greek, the number system requires the use of
> archaic characters. Ken might comment more.
>
> Now, I don't think these characters really belong, but I'll comment on
> that elsewhere. Just in terms of cleaning up the paragraph so as to
> remove the problematic parts, you could replace the above by something
> like:
>
> =>
> The following characters are used in different scripts to indicate
> that an adjacent letter is being used with a numeric value.
>
> U+02B9 ( ʹ ) MODIFIER LETTER PRIME
> U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN
> U+0483 ( ҃ ) COMBINING CYRILLIC TITLO

The paragraph now reads:

> The characters 02B9, 0375 and 0483 are used in different scripts
> to indicate that an adjacent letter is being used with a numeric
> value.

> 2.2.5.  Category I - Require special treatment in Lookup and extended
>       special treatment in Resolution
>
>  I: generalCategory(cp) is in {Cf}
>
> You should remove this. There is no need for any other Cf characters
> than the Join controls, as we discussed in January.

I was not at the meeting, and do not remember a full conclusion on  
this issue.

What do others think?

This is still in -05a that I work with.

> Section 3.
> Right now, you have single-letter names like Category A, and use them
> in Section 3 for calculation. So we see text like:
>
>  o  If the codepoint is in Category F (Section 2.2.2), the value is
>     according to the table in Section 2.2.2.
>  o  If the codepoint is in Category G (Section 2.2.3), the value is
>     according to the table in Section 2.2.3.
> ...
>  o  If the codepoint is not in Category A (Section 2.1.1), the value
>     is DISALLOWED.
>
> In this day and age, we don't have to save on letters in a
> specification ;-). I've remarked on this several times, as have
> others. We haven't seen any reasons given for hewing to Fortran-style
> variable names!
>
> Please rewrite the category names to use meaningful names instead of
> single letters. There is also some old language like "   First the
> special cases.  If there is a match, do not go to the second phase.
> ...according to the table", which is no longer applicable given the
> restructuring after the January meeting.
>
> Thus the above would become:
>
>  o  If the codepoint is in the Category Exceptions (Section 2.2.2),
> the value is PVALID.
>  o  If the codepoint is in Category Backward_Compatibility (Section
> 2.2.3), the value is PVALID.
> ...
>
> and so on.

"Old language" is now gone. The categories are still "letters" only,  
as I am (obviously) an old fart that not always like generic names as  
they by themselves might add value judgement that is not really  
needed. It also makes it easier in the tables to mark (with single  
characters) what rules matches.

More people that want these names changed?

    Patrik

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : http://www.alvestrand.no/pipermail/idna-update/attachments/20080216/f1bd70b4/PGP-0001.bin


More information about the Idna-update mailing list