New version, draft-faltstrom-idnabis-tables-02.txt, available

Kenneth Whistler kenw at sybase.com
Fri Jun 8 02:29:52 CEST 2007


Patrik,

> I want you to specifically comment on the definitions of the rules  
> used to select codepoints in section 2, 

Section 2.2, Rule B - Normalization, contains some incorrect
statements about the data files. In particular:

"Normalization rules are found in UnicodeData.txt in the sixth column."

That is not true. Those are not rules. What is in the sixth
column is the decomposition *mapping* for a character (if
not trivially mapped to itself). And that mapping is then
used in the set of rules in UAX #15 that define the various
normalization forms.

So the following statement is also not true:

"The data (sixth column) include both the normalization and, ..."

The decomposition mapping for a character is not necessarily
the same as its "normalization", for any of the 4 defined
normalization forms. That is because normalization is
defined in terms of the recursive application of all
decomposition mappings.

A similar confusion is in the wording of the paragraph
discussing LATIN SMALL LETTER L WITH MIDDLE DOT:

"... while the normalized data is U+006C U+00B7 ..."

Actually, <U+006C, U+00B7> is the (compatibility) decomposition mapping.
The normalized form of LATIN SMALL LETTER L WITH MIDDLE DOT
varies, depending on the normalization form chosen.

Section 2.3. Rule C - Casefolding.

The data that specificies if a rule in SpecialCasing.txt
is conditional is actually in the *5th* column of SpecialCasing.txt,
not the *6th* column.

Section 2.4. Rule D - Ignorables

For the statement of the derivation of Default_Ignorable_Code_Point,
please see the erratum of 2007-January-25 posted at:

http://www.unicode.org/errata/

The correct statement is:

Other_Default_Ignorable_Code_Point + Cf + Cc + Cs
+ Noncharacter_Code_Point + Variation_Selector - White_Space
- FFF9..FFFB (Annotation Characters)

(Varation_Selector was inadvertantly left out of the description
of the derivation, although it is correctly part of the actual
derivation of the list in the data file.)

And the statement "Noncharacters is a property only existing in the
NamesList.txt." is unnecessary. The normative listing of the
Noncharacter_Code_Point property is in PropList.txt. And that
is the property used for derivation.


> and the algorithm of how to  
> calculate the value of the derived property in section 3.

I'll provide feedback on that separately. I don't think the
specification of Rule H (distinguishing Latin, Greek, and
Cyrillic as "Stable scripts" in contradistinction to all
other scripts) makes sense -- and so the algorithm, which
makes distinctions based very prominently on Rule H, is,
in my opinion, overly complex and unclear.

> 
> On top of that of course also a comparison of your results when doing  
> the same calculations with the result I got with my code in section 4.

I can, of course, grab the I-D text and do a bunch of editing,
in an attempt to turn the listing in Section 4.1 into
something that is machine-readable. But it would be really
helpful for these kinds of evaluations if the machine readable
form of these tables were simply posted in a specified location
for use in comparisons of results. That would avoid the
extra work of repeated manual editing to extract, as well
as avoiding the probably introduction of extraneous errors
just from the manual editing.

Regards,

--Ken



More information about the Idna-update mailing list