Moving Right Along on the Inclusions Table...

Thu Dec 21 00:03:49 CET 2006

Harald,

> > Ken,
> >
> > at a more 10.000 foot level:
> > are you of the opinion that an IDNA restriction rule, in order to be
> > viable, needs to include:
> >
> > - A (short) table of rules based on existing Unicode properties
> > (including "script")
> > - An exclusion table, excluding charcters that are included by the rules
> > - An inclusion table, including characters that are excluded by the rules
> > - A set of context dependent rules, saying that certain characters can
> > only be used in certain combinations?
> >
> 
> Yes, that's a good restatement.

I would put it a little differently. Formally, that is equivalent to:

An IDNA restriction rule, in order to be viable, needs to comprise:

  1. A table of allowed characters
  2. A set of context dependent rules, saying that certain
     characters can only be used in certain combinations

The table itself can and should be expressed simply as an inclusion
table -- which means any character in it is allowed (and implicitly,
any character not in it is *excluded* and not allowed). Doing it
that way makes implementation straightforward.

It is completely different (and might require different intermediate
structures) to determine:

A. How the table of allowed characters should be decided on in the
first place.

That is a process question, and involves us weighing various alternative
criteria and judging generality of criteria versus benefits and
drawbacks either of inclusion or exclusion of various subsets
of the overall possible repertoire.

B. How the table of allowed characters is *explained* in the protocol,
once its contents is decided.

My suggestion for how to handle A is with two tables (for now),
which are the SPInclusionList.txt and the SPInclusionAdd.txt I
have posted. The SPInclusionList.txt rolls up all the results of
paring down the domain (set theory sense of "domain") of
characters in Unicode first by broad rules that are property-based
(including specification of script values) and then by narrower
rules (which involve, as Mark indicated, some exclusions based
on properties *not* formally carried as Unicode character properties
in the Unicode Character Database, such as "combining marks
consisting of superscripted Latin letters used in the medievalist
manuscript tradition"). Then SPInclusionAdd.txt rolls up all the
suggestions for exceptional "add backs", for characters that
were strained out by all the other rules, but which still need to
be there for one reason or another.

At the end of the deliberation process, we concatenate
SPInclusionList.txt and SPInclusionAdd.txt and you get the
single inclusion list needed for the protocol definition.

Note that from the point of view of Mark's formalism for
set definition that he uses to generate tables, that is
also equivalent to simply adding some " ... +X, +[Y..Z] }"
subrules to the end of the single overall derivation rule.

For B, I think the table should be explained by making the
rules that went into its derivation explicit. So the protocol
specification can have a section that says, in effect, the
inclusion table was defined by using the following rules
based on Unicode character properties (including script values),
normalization rules, casefolding rules, the omission
of certain problematical ranges of code points, and the
addition of a few required exceptional characters. Spell out
those rules clearly enough, and people will stop asking
why the table has X but not Y in it, and will also have
a very well-founded expectation about what happens to the
table (and what does not happen to the table) when the
Unicode Standard adds another clump of characters for
Unicode 5.1, Unicode 6.0 or beyond.

But the rest of the StringPrep algorithm (and IDNA200X)
simply depend on the inclusion table itself. Simple and
easy.

> 
> I think this may be a correct conclusion to draw from the discussion so
> > far, but does mean that we have admitted that we need to examine the
> > codeset character by character, at some level.
> 
> 
> Ken has been making suggestions for reductions which are really also based
> on classes of characters, just ones that are not formalized as Unicode
> properties. 

Correct.

> Frankly, few of them would cause spoofing problems, and I
> wouldn't bother with them at all except that insofar as they are clearly not
> used by modern languages, they are safe to exclude. Once such cases are
> eliminated, then I don't think it is productive to continue on that line,
> since other techniques, such as mixed-script detection, are far more
> powerful.

I agree. I suggested removal of the most obvious subsets of
combining marks that would not be needed by internet identifiers,
ever, in any context, because combining marks already cause the
most heartburn among folks worrying about IDNs. And I do think
that people need to take a look at the script repertoires, to
see, as Cary has been indicating for Hebrew, whether some few
omissions need to be rectified in the list. But beyond that,
I think we quickly get into sharply diminishing returns, where
trying to identify and eliminate unused or unuseful characters
on a script-by-script, character-by-character basis becomes
a misuse of our time for essentially zero payoff. 

There are much bigger fish to fry here than worrying about
whether U+02AD LATIN LETTER BIDENTAL PERCUSSIVE should be
in the inclusion list or not. My cost/benefit analysis in
cases like that is: It benefits little to have it in, it
costs little to have it in, it benefits little to have it out,
it costs little to have it out, but it costs a *lot* to
stymie and delay completion of the IDNA200X protocol update
arguing about cases like it.

We should be focussing now, I think, on the exact way to
spell out the needed context-sensitive issues for StringPrep
(and IDNA200X), i.e. the stuff under 2 above, and more
explicitly the issue of combining marks in bidi string runs
and the allowable contexts for ZWJ and ZWNJ (PRI #96).

--Ken

> 
> Harald