Table Analysis for draft-faltstrom-idnabis-tables-02.txt

Tue Jun 12 22:32:37 CEST 2007

Patrik,

> On top of that of course also a comparison of your results when doing  
> the same calculations with the result I got with my code in section 4.

O.k., today I provide the detailed comparison of the ALWAYS and
NEVER values in draft-faltstrom-idnabis-tables-02.txt, Section 4.1,
with the values I have proposed in my most recent postings of
IDNPermitted.txt and IDNNever.txt.

This comparison goes beyond simply a reverification of the
calculations specified according to the current draft of
draft-faltstrom-idnabis-tables-02.txt, but in the course of
the comparisons below, I point out a number of what I think
are mistakes in the handling of either normalization or
casefolding, where I think the values in the table in
Section 4.1 depart even from the intent of the algorithm
as stated.

Based on the wording in the I-D that the "derived property
[identifies] groups of characters:"

o Those that should clearly be included in IDNs
o Those that should clearly not be included in IDNs
o Those where no final determination can be made at this time

I make below the comparison between the I-D value ALWAYS
and my draft of the IDN_Permitted property, and the I-D
value NEVER and my draft of the IDN_Never property.
That is because the intent of IDN_Permitted is to identify
those characters "that should clearly be included in IDNs"
and the intent of IDN_Never is to identify those
characters "that should clearly not be included in IDNs."

I'll start first with ALWAYS.

===================================================================

ALWAYS

By rule G, you get the ASCII legacy exceptional inclusions:

002D HYPHEN-MINUS
0030 DIGIT ZERO
0041 LATIN CAPITAL LETTER A
etc.

By rule H & A & ~X you get LGC lowercase letters.

006A LATIN SMALL LETTER A
etc.

For Latin, Greek, and Cyrillic, this overlaps extensively
with IDN_Permitted. The differences are specified in the
following two sections.

------------------------------------------------------------------

IDN_Permitted and not ALWAYS:

A. IDN_Permitted exception list:

00B7 MIDDLE DOT
200C ZERO WIDTH NON-JOINER
200D ZERO WIDTH JOINER
30FB KATAKANA MIDDLE DOT

(Also omitted from ALWAYS are 05F3 HEBREW PUNCTUATION GERESH, 
05F4 HEBREW PUNCTUATION GERSHAYIM, and 3007 IDEOGRAPHIC DIGIT ZERO,
but those are also excluded from ALWAYS by their script as well.)

Discussion. These are the exceptions (along with ASCII) noted in
SPInclusionAdd070308.txt, for which there are good reasons to
allow them in IDNs, despite the fact that they aren't covered
by the general category criterion.

B. Mistakes in normalization in draft-faltstrom-idnabis-tables-02.txt:

01D6 LATIN SMALL LETTER U WITH DIAERESIS AND MACRON
01D8 LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE
01DA LATIN SMALL LETTER U WITH DIAERESIS AND CARON
01DC LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE
01DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
01E1 LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
01ED etc.
01FB
022B
022D
0231
1E09 etc. in Latin Extended Additional and Greek Extended

Discussion. These seem to be the result of a bug in the way
the normalization instability criterion, Rule B, NFKC(cp) != cp,
was implemented in producing the table. All of these
two-accent letters are stable under NFKC(cp), and by the
stated criteria should have the value ALWAYS. Instead, they
erroneously ended up with NEVER.

C. Modifier letters:

02B9..02C1
02C6..02D1
02EE

Discussion. These were omitted from ALWAYS because they
are script=Common, instead of script=Latin. The reason they
are script=Common is that they may be used with some scripts
other than Latin and aren't themselves based on Latin letterforms,
but some of them at least are definitely used with significant
Latin orthographies. For example, U+02BB is the "okina" (glottal
stop) used in Hawai'ian and a number of other Pacific Island
orthographies.

D. Combining marks:

0300..033F
0342
0346..0362
1DC2
1DC4..1DCA
1DFE..1DFF

Discussion. These also ended up omitted from ALWAYS because they
are script=Common, instead of script=Latin. Again, the reason
is that they may (and often are) used with scripts other than
Latin. And in fact a number of these combining marks are
used with Latin, Greek, *and* Cyrillic.

There is a deeper problem with omitting these combining marks
from ALWAYS, however, and that is that it results in canonically
equivalent sequences of characters being assessed differently
as to the derived property status. A simple case in point:
U+00E0 LATIN SMALL LETTER A WITH GRAVE is given the value
ALWAYS, but its canonically equivalent sequence of
<U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT>
would be assessed as <ALWAYS, MAYBE YES>. If this property
is to be used to determine the status of strings as valid
for IDNs in the context of the larger IDNAbis definition,
then having canonically equivalent sequences evaluated
differently is very bad for the algorithm.

E. Various scripts:

Armenian, Hebrew, Arabic, Thaana, Nko, Devanagari, Bengali,
Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam,
Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul,
Ethiopic, Cherokee, Canadian Syllabics, Buhid, Khmer,
Mongolian, Limbu, Tai Le, New Tai Lue, Balinese, Tifinagh,
Hiragana, Katakana, Bopomofo, Han, Yi

Discussion. These omissions are by intent, of course, by
application of Rule H. The list is provided here for
documentation, for completeness in noting the differences
between IDN_Permitted and the ALWAYS value from the I-D.

------------------------------------------------------------------

ALWAYS and not IDN_Permitted:

F. Mistake in casefolding in draft-faltstrom-idnabis-tables-02.txt:

0130 LATIN CAPITAL LETTER I WITH DOT ABOVE

Discussion. The value of ALWAYS for 0130 appears to be a mistake
in the table generation. 0130 case folds to <0069, 0307>, so
should be excluded by Rule C.

G. Mistake in normalization in draft-faltstrom-idnabis-tables-02.txt:

037A GREEK YPOGEGRAMMENI --> IDN_Never

Discussion. This appears to be another mistake in the table
generation -- this time involving normalization. U+037A is
not stable under NFKC: NFKC(037A) = <0020, 0345>. As for all
characters not stable under NFKC, it appears in IDN_Never,
and NEVER is the appropriate assignment for the I-D.

===================================================================

O.k., now on to the assessment of the comparison between
IDN_Never and the value of NEVER in draft-faltstrom-idnabis-tables-02.txt.

===================================================================

NEVER

By rule H & (~A | X) you get various exclusions.

Wrong general category:

0482 CYRILLIC THOUSANDS SIGN   (gc=So)
0488 COMBINING CYRILLIC HUNDRED THOUSANDS SIGN  (gc=Me)
etc.

Unstable under NFKC(cp):

00AA FEMININE ORDINAL INDICATOR
etc.

Unstable under casefold(cp):

00C0 LATIN CAPITAL LETTER A WITH GRAVE
etc.

Ignorable:

{empty set}

Discussion. There are no characters which can meet the
criterion of being Latin, Greek, or Cyrillic script and
also being Default_Ignorable_Code_Point, so the set of
characters which would be specified by the application
of the rules H & D is necessarily empty.

For Latin, Greek, and Cyrillic, this is some overlap
with IDN_Never. Here are the differences.

------------------------------------------------------------------

IDN_Never and NEVER:

A. Latin letters and modifier letters unstable under NFKC(cp)

Discussion. This is the intersection, where the two tables
agree.

------------------------------------------------------------------

IDN_Never and not NEVER:

B. Symbols, Punctuation, Numerics (non-digits), Format Controls, 
ISO Controls, Noncharacters

Discussion. IDN_Never takes a stronger stance on these, viewing
all of these categories as permanently inappropriate for IDNs.
The I-D ends up placing these in MAYBE NOT, even though
some, such as ISO Controls and Noncharacters are clearly
inappropriate for IDNs by anyone's assessment.

C. Combining marks unstable under NFKC(cp)

0341 COMBINING ACUTE TONE MARK (the deprecated duplicate for Vietnamese)
etc.

Discussion. These should end up in the NEVER status, because
of they are unstable under NFKC(cp). Instead they end up MAYBE NOT
because they are not specifically Latin, Greek, or Cyrillic script.

D. Miscellaneous:

00B5 MICRO SIGN

Discussion. This is similar to the combining marks issue. This
is gc=Ll and script=Common, so escapes getting assigned NEVER,
even though it is unstable under NFKC(cp).

------------------------------------------------------------------

NEVER and not IDN_Never

E. Latin, Greek, and Cyrillic uppercase letters

Discussion. IDN_Never does not include lists of characters unstable
under casefolding, because the status of casefolding as part
of the protocol definition or recommended as preprocessing
outside the context of the protocol was still in question,
and it seems inadvisable to include casing issues in a
property (IDN_Never) designed to be a rock bottom, immutable
guarantee of characters that could never, ever be appropriate
for IDNs.

F. Cyrillic combining enclosing marks

Discussion. These end up NEVER because they are script=Cyrillic,
but the general category of Me (combining enclosing mark) is
not criterial for adding characters to IDN_Never.

G. 00DF LATIN SMALL LETTER SHARP S

This is related to E above. The I-D puts this character in NEVER,
because it is not stable under casefolding, but that is not
a criterion for IDN_Never.

H. Mistake in normalization in draft-faltstrom-idnabis-tables-02.txt:

Latin and Greek lowercase letters with two diacritics

Discussion. These were already discussed above. These should
be ALWAYS, not NEVER.

===================================================================

If you have followed to this point, it should be apparent
how differential application order of various criteria -- in
effect weighting some more highly than others -- has a rather
large impact on the difference in the tables.

In particular, IDN_Never takes Rule B, NFKC(cp) != cp, as
absolutely criterial. All characters with that status are
given IDN_Never, regardless of their script or general
category status.

By contrast, the way the rules and algorithms are set up
in draft-faltstrom-idnabis-tables-02.txt, characters that
are NFKC(cp) != cp can end up either NEVER or MAYBE NOT,
depending on whether they are Latin/Greek/Cyrillic or
anything else.

--Ken