New version, draft-faltstrom-idnabis-tables-02.txt, available

Kenneth Whistler kenw at sybase.com
Tue Jun 12 03:50:51 CEST 2007


Patrik,

I've now done further analysis on the rules and algorithm
in your document, and have compared the resulting derived
table with the results I calculated and posted (see other
threads for full description) as the IDN_Permitted and
IDN_Never properties (in IDNPermitted.txt and IDNNever.txt,
respectively).

First, the high level analysis. 

The draft doesn't actually define clearly *what* the derived 
property is, but implies that its values are for deciding
what characters are to be used in IDNs. In particular,
in the introduction:

"This document reviews the collections of codepoints in 
Unicode from by looking at various properties of the codepoints,
        ^^^^^^^
        [typo to fix] 
and defines a dervied property that identify groups of
                                    ^^^^^^^^
                                    identifies
                                    [another typo]
characters.

o  Those that should clearly be included in IDNs
o  Those that should clearly not be included in IDNs
o  Those where no final determination can be made at this time"

The algorithm for the calculation of the derived property
in Section 3, on the other hand, doesn't relate specifically
back to those bullets, but instead defines four values:

o  ALWAYS
o  MAYBE YES
o  MAYBE NOT
o  NEVER

>From the discussion we had about the last version of this draft,
I can assume that the intent here is as follows:

ALWAYS --> Those that should clearly be included in IDNs
NEVER  --> Those that should clearly not be included in IDNs
MAYBE  --> Those where no final determination can be made at this time

and that "MAYBE YES" versus "MAYBE NOT" is a qualification of the
"MAYBE" status.

Section 2 ("The rules used") provides a third incompatible
description of the derived property:

"For each rule, it is specified whether it is a rule that increase
[sic, --> "increases"] or decrease [sic, --> "decreases"] the
value of the property (regarding likelihood to be included in
a U-label), ..."

This makes it sound as if the property is "likelihood to be included
in a U-label", which could be interpreted as a stochastic property
of some sort, and also implies that we are dealing with a numeric
scale of some sort, rather than a discrete 3-valued (or 4-valued)
property.

So I think it is fair to say that the biggest problem with the
document at the moment is that it isn't clear about exactly
what is being defined and derived. That needs to be addressed
at the points I've identified above.

O.k., on to the rules and algorithm.

The rules themselves are a significant improvement over the first 
draft. I think we have substantial agreement that Rules A through G
as stated are relevant to the derivation of a property that
would be useful for defining the set of characters to be
included in IDNs. (Although see my other responses for clarification
in the statement of Rule D about Ignorables.)

I disagree about the usefulness of Rule H "Stable scripts", which
attempts to give IDN primacy to Latin, Greek, and Cyrillic on
the grounds that they "have encodings [that] are stable enough
for use in IDNs." On the contrary, many of the issues addressed
by some of the other rules (instability under NFKC(cp) and
instability under casefold(cp)) are most pronounced and problematical
precisely for the *Latin* script, whereas many of the scripts
supposedly "not stable enough" have no casing issues and few
if any NFKC issues. Furthermore, I consider it silly and
a complete non-starter to go out with a draft definition of
a property for defining what characters are suitable for IDNs,
but which sticks basic Japanese all in the MAYBE category.

The introduction of Rule H also unnecessarily elaborates and
complicates the definition of the algorithm in Section 3, without
substantially improving the results.

I'm not sure exactly how you envision the algorithm being
implemented, but as stated, it is a sequence of matching rules
applied in a particular order, which bleed out input at
various terminal states with a value determination, but which
thereby hide the imputed significance of the terminal states
(which are supposed to define the values of the property).

Also the sequential application of matching rules stated this
way hides some of the dependencies between the rules. For
example, Rule H and Rule D cannot be simultaneously True,
because there simply are no Latin, Greek or Cyrillic script
characters that are also Default_Ignorable_Code_Point=True.
That doesn't break the algorithm, necessarily, but it does
muddy the meaning of the outcome.

Furthermore, if you look at the intent of Rules B, C, and D,
all of them are intended as exclusionary criteria, and all
of them are treated in parallel in the statement of the
algorithm. For clarity below, I'll restate them as a single
exclusionary rule:

Rule X := (B | C | D)

  = unstable under NFKC | unstable under casefolding | ignorable

Now here is an attempt to take the algorithm and restate it
in terms of what I *think* it is trying to do by application
of the rules, and based also on inspection of the property
values actually given in the table in Section 4.1.

Feel free to correct me if I've misconstrued this somehow.

===================================================================

ALWAYS

Defined as: G | (H & A & ~X)

Discursively: Grandfathered in ASCII, plus any LGC letters that
       are in the right categories and are not unstable under
       normalization or casefolding.
       
[The following three have an implied ~G, namely not [-A-Z0-9],
but I won't repeat that for each of them.]

NEVER

Defined as: H & (~A | X)

Discursively: Any LGC characters that are not in the right
       categories or which are unstable under normalization
       or casefolding.

MAYBE YES

Defined as: ~E & ~F & (~H & A & ~X)

Discursively: Any characters from any non-LGC scripts not in
       the obsolete scripts list, that are also in the right
       categories and are not unstable under normalization
       or casefolding.

MAYBE NOT

Defined as: E | F | (~H & (X | ~A))

Discursively: Obsolete scripts, excluded blocks of characters
       (combining marks for symbols, musical symbols), and
       any characters from any non-LGC scripts not in the
       obsolete scripts list, that are not in the right
       categories or which are unstable under normalization
       or casefolding. Ignorables and noncharacters and all 
       reserved code points.

===================================================================

Now if that is the intent of the four values defined in the
table and how they are related to the 8 rules, there are some
further issues.

The immediate technical issue with the statement of the algorithm
is that it doesn't specify what to do with reserved code points,
but they somehow end up in the table listed as "MAYBE NOT".
A rule seems to be missing, as well as a step in the algorithm.

Second, the results have principled mismatches with the
intent of IDN_Permitted and IDN_Never that result not so much
from the rules themselves as in the differential weighting of
their importance for outcomes, implied by the way the sequential
steps of the algorithm are set up. The derivation of IDN_Permitted
and IDN_Never emphasized character properties per se, whereas
the algorithm in draft-faltstrom-idnabis-tables-02.txt seems
to be weighting uncertainly regarding the stability of encoding
for scripts most highly, in terms of deciding where characters
belong.

Third, there are some actual bugs in the derivation of the
table in Section 4.1, having to do with edge cases for
normalization and casefolding.

I'll address these last two issues in a separate contribution
where I do the detailed comparison between IDN_Permitted,
IDN_Never, and the listing in Section 4.1. (Tomorrow.)

--Ken



More information about the Idna-update mailing list