Table-building

Fri Dec 15 08:45:52 CET 2006

FWIW: I understand what John wants, and I agree that it would be good  
for us to have such a table. For example, if I look at the list of  
comments from Kenneth below, the "not yet" category could be  
"whatever is included by a rule that we still discuss".

UNFORTUNATELY, the software I wrote that I use at the moment only  
have two states, so for the tables that will be released by myself  
today, you will only see yes and no.

Partially because people earlier on very explicitly wanted it sorted  
by script, and in each script divided in included and not included  
codepoints.

(Because of this, I think Kens lists are complementary as they are  
codepoint by codepoint...)

    Patrik

On 15 dec 2006, at 03.24, Kenneth Whistler wrote:

> John said:
>
>> To remind people of something else that was said on the call, we
>> are really looking, I think, for a three-state table consisting
>> of
>>     Clearly yes now
>>     Not yet: need advice, refinement, and/or an identified user
>> community
>>     Clearly no, now and forever.
>>
>> rather than "ok" and "not ok".
>
> I'm going to have to strongly disagree with that. I don't think
> what we are really looking for is such a three-state table
> at all. Trying to cast the task that way, and to build a
> table that way, is a recipe for lingering confusion.
>
> The first confusion it perpetuates is the level confusion
> between the status of an individual character as being
> in or out, and the status of criteria and rules as applying
> (or not) to the building of the table.
>
> This confusion is quite evident in the way the original
> draft of faltstrom-idnabis-tables was constructed, I think.
> On the one hand, it purported to be a category by category
> and a block by block assessment of the tristate, but on
> the other hand, the assessments themselves depended implicitly
> on similar tristate assessments of the characters in each
> category and in each block. Then the way it ends up getting
> laid out, the result is confusing, and it become unclear
> how to make progression towards deciding anything.
>
> I've been trying to offer what I consider the better
> approach to converging to conclusion here.
>
> The *table* itself should unambiguously be defined as
> the list of characters appropriate for inclusion in
> IDNA. IDNAInclusion.txt (or whatever name you like).
>
> That table should be constructed by a set of clear criteria,
> as Mark as been doing.
>
> Then you examine each of the criteria (or possible new
> criteria or modifications to the criteria), and in the
> accompanying explanation document (the internet draft)
> make assessments regarding the status of each criterion,
> and raise questions for potential feedback, based on
> what the consensus is regarding how evidently correct
> (or shaky, or preliminary) each criterion is.
>
> For example, the criterion NFKD(cp) = cp should simply be
> marked as fixed and decided. Nobody in their right minds
> is going to suggest that we start tweedling with that
> criterion to add specific exceptions or modify the
> tables behind it, whatever.
>
> In my opinion, the first criterion:
>
> 1. If generalCategory(cp) is in {Ll, Lu, Lo, Lm, Mn, Mc, Nd}, add cp
>
> is also something that we should simply stand by and take as
> fixed. There is nothing to be gained by giving people
> the impression that the IDNA inclusions table could be
> improved by debating over whether entire other classes
> should be added to that criterion, or whether any of those classes
> be removed as a class.
>
> On the other hand, the script criterion:
>
> 5. If script(cp) in {Xsux, Ugar, Xpeo, Goth, Ital, Cprt, Linb,
> Phnx, Khar, Phag, Glag, Shaw, Dsrt, Runr}, remove cp
>
> is precisely the point where the kind of judgement about
> where to set the boundaries starts to get interesting.
>
> I think this discussion group would be well-advised to triage the
> scripts into categories along the lines the John has suggested,
> because possible amendments to this rule on a script by
> script basis are not only reasonable to envision -- there
> might be very good justifications to omit more scripts
> (or include more) before the final verison of the
> IDNAInclusions.txt table is created.
>
> It is clear that Latin, Greek, Cyrillic, Han, ... *MUST* be
> in the table.
>
> I think it is clear that extinct, historic scripts like Xsux
> (Sumero-Akkadian cuneiform) *MUST NOT* be in the table.
>
> And we can then meet in the middle, and possibly come up with
> a short list of scripts where we just aren't sure, and where
> the safest course might be to omit them now, but leave the
> door open for addition later.
>
> *That* information can be spelled out, nuanced as necessary,
> in idnabis-tables, where it can be discussed and changes
> made, if we come to agreement. But I *don't* think it is helpful
> to try to carry it in a IDNAInclusions.txt in tristate
> (or multistate) variables in the table.
>
> My first cut on what SHOULD be excluded on a per script basis
> is pretty much exactly as Mark has indicated. Although I think
> Runic is an example of a script where we start to get into
> the "it's debatable" category.
>
>> Someone suggested on the call that many additional gradations
>> are possible within the second category.  Certainly that is
>> true.  I don't know how useful it would be in practice to try to
>> tease them out, but people who are convinced that it is should
>> certainly try to do that.
>
> I would start first by resolving the script by script
> status as clearly as possible.
>
> Then by specifying specific subsets of combining marks
> to receive the same kind of triage discussion. Resolution
> of those in terms of criteria will probably end up simply
> as rules to exclude particular code point ranges.
>
> Finally, we need to scan for the occasional individual
> character whose inclusion is still problematical for
> one reason or another and the occasional individual
> character (HEBREW GERESH) whose omission is still problematical
> for one reason or another. Those will all tend to be
> special cases, and their status should be discussed on
> a case by case basis, as necessary.
>
> When we get to that point, and as long as IDNAInclusions.txt
> has tracked the current best resolutions of status, the
> whole process very quickly converges, in my opinion, to
> the point of diminishing returns on further effort,
> and we can happily complete the process.
>
>> One of the implications of the three-state model is that we
>> don't need to settle the question of which scripts are
>> sufficiently archaic to be excluded: we put the lot of them into
>> the middle category and see if a community of users of the
>> script comes forward and makes a case for using it (or part of
>> it) in domain names.
>
> I think we get better results much quicker, if we do the
> script triage like this:
>
> IN                 ??                OUT
> XXXXXXXXXXXXXX    XXXX   XXXXXXXXXXXXXXX
>
> than if we do it like this:
>
> IN                 ??                OUT
> XXX    XXXXXXXXXXXXXXXXXXXXXXXXXXX   XXX
>
> In other words, I don't think it is advisable to tell people
> there is a very large middle category and then wait around
> for lots of different groups to come forward and try to
> make the case for each one.
>
> It is far more efficient to place a clearer stake in the ground
> (and I think Mark and I have chosen a quite defensible cutoff
> point to start the discussion), and then wait for people to
> argue the few cases at the edge.
>
> --Ken
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update