Table-building

Fri Dec 15 03:24:29 CET 2006

John said:

> To remind people of something else that was said on the call, we 
> are really looking, I think, for a three-state table consisting 
> of
>     Clearly yes now
>     Not yet: need advice, refinement, and/or an identified user 
> community
>     Clearly no, now and forever.
> 
> rather than "ok" and "not ok".

I'm going to have to strongly disagree with that. I don't think
what we are really looking for is such a three-state table
at all. Trying to cast the task that way, and to build a
table that way, is a recipe for lingering confusion.

The first confusion it perpetuates is the level confusion
between the status of an individual character as being
in or out, and the status of criteria and rules as applying
(or not) to the building of the table.

This confusion is quite evident in the way the original
draft of faltstrom-idnabis-tables was constructed, I think.
On the one hand, it purported to be a category by category
and a block by block assessment of the tristate, but on
the other hand, the assessments themselves depended implicitly
on similar tristate assessments of the characters in each
category and in each block. Then the way it ends up getting
laid out, the result is confusing, and it become unclear
how to make progression towards deciding anything.

I've been trying to offer what I consider the better
approach to converging to conclusion here.

The *table* itself should unambiguously be defined as
the list of characters appropriate for inclusion in
IDNA. IDNAInclusion.txt (or whatever name you like).

That table should be constructed by a set of clear criteria,
as Mark as been doing.

Then you examine each of the criteria (or possible new
criteria or modifications to the criteria), and in the
accompanying explanation document (the internet draft)
make assessments regarding the status of each criterion,
and raise questions for potential feedback, based on
what the consensus is regarding how evidently correct
(or shaky, or preliminary) each criterion is.

For example, the criterion NFKD(cp) = cp should simply be
marked as fixed and decided. Nobody in their right minds
is going to suggest that we start tweedling with that
criterion to add specific exceptions or modify the
tables behind it, whatever.

In my opinion, the first criterion:

1. If generalCategory(cp) is in {Ll, Lu, Lo, Lm, Mn, Mc, Nd}, add cp

is also something that we should simply stand by and take as
fixed. There is nothing to be gained by giving people
the impression that the IDNA inclusions table could be
improved by debating over whether entire other classes
should be added to that criterion, or whether any of those classes
be removed as a class.

On the other hand, the script criterion:

5. If script(cp) in {Xsux, Ugar, Xpeo, Goth, Ital, Cprt, Linb,  
Phnx, Khar, Phag, Glag, Shaw, Dsrt, Runr}, remove cp

is precisely the point where the kind of judgement about
where to set the boundaries starts to get interesting.

I think this discussion group would be well-advised to triage the
scripts into categories along the lines the John has suggested,
because possible amendments to this rule on a script by
script basis are not only reasonable to envision -- there
might be very good justifications to omit more scripts
(or include more) before the final verison of the
IDNAInclusions.txt table is created.

It is clear that Latin, Greek, Cyrillic, Han, ... *MUST* be
in the table.

I think it is clear that extinct, historic scripts like Xsux
(Sumero-Akkadian cuneiform) *MUST NOT* be in the table.

And we can then meet in the middle, and possibly come up with
a short list of scripts where we just aren't sure, and where
the safest course might be to omit them now, but leave the
door open for addition later.

*That* information can be spelled out, nuanced as necessary,
in idnabis-tables, where it can be discussed and changes
made, if we come to agreement. But I *don't* think it is helpful
to try to carry it in a IDNAInclusions.txt in tristate
(or multistate) variables in the table.

My first cut on what SHOULD be excluded on a per script basis
is pretty much exactly as Mark has indicated. Although I think
Runic is an example of a script where we start to get into
the "it's debatable" category.

> Someone suggested on the call that many additional gradations 
> are possible within the second category.  Certainly that is 
> true.  I don't know how useful it would be in practice to try to 
> tease them out, but people who are convinced that it is should 
> certainly try to do that.

I would start first by resolving the script by script
status as clearly as possible.

Then by specifying specific subsets of combining marks
to receive the same kind of triage discussion. Resolution
of those in terms of criteria will probably end up simply
as rules to exclude particular code point ranges.

Finally, we need to scan for the occasional individual
character whose inclusion is still problematical for
one reason or another and the occasional individual
character (HEBREW GERESH) whose omission is still problematical
for one reason or another. Those will all tend to be
special cases, and their status should be discussed on
a case by case basis, as necessary.

When we get to that point, and as long as IDNAInclusions.txt
has tracked the current best resolutions of status, the
whole process very quickly converges, in my opinion, to
the point of diminishing returns on further effort,
and we can happily complete the process.

> One of the implications of the three-state model is that we 
> don't need to settle the question of which scripts are 
> sufficiently archaic to be excluded: we put the lot of them into 
> the middle category and see if a community of users of the 
> script comes forward and makes a case for using it (or part of 
> it) in domain names.

I think we get better results much quicker, if we do the
script triage like this:

IN                 ??                OUT
XXXXXXXXXXXXXX    XXXX   XXXXXXXXXXXXXXX

than if we do it like this:

IN                 ??                OUT
XXX    XXXXXXXXXXXXXXXXXXXXXXXXXXX   XXX

In other words, I don't think it is advisable to tell people
there is a very large middle category and then wait around
for lots of different groups to come forward and try to
make the case for each one.

It is far more efficient to place a clearer stake in the ground
(and I think Mark and I have chosen a quite defensible cutoff
point to start the discussion), and then wait for people to
argue the few cases at the edge.

--Ken