Outstanding issues(4): Contextual rule definitions and registry

Sun May 25 20:55:40 CEST 2008

In the hope of getting some discussion going that focuses on 
unresolved issues in the WG's charter, I'm about to post four 
notes that contain a list of substantive outstanding issues and 
loose ends with the documents for which I hold the pen.   This 
is the fourth of those four.  Please, for all four, if you open 
up significant new topics, change the subject line.   And, for 
this one and the second and third, please use separate threads 
for each issue so that we can discuss them, rather than 
addressing omnibus notes to the editor: if these topics were not 
at least somewhat uncertain or controversial, they would have 
been resolved and reflected in the documents by now.

The major innovation in the IDNA2008 proposals, at least IMO, is 
the idea that characters that are plausible for IDN use only in 
specific contexts can be used by defining and testing those 
contexts on a "full label" or "adjacent character(s)" basis as 
needed.  The mechanism raises a number of issues and the text in 
Rationale and Protocol (especially the "Contextual Rules 
appendix") that describes it contains some significant 
handwaving (of which the design team is obviously aware).

This note tries to summarize the key outstanding issues in the 
hope of eliciting focused WG discussion and suggestions about 
resolution.

(1) Definition of the rules.

It turns out that, as long as there are not very many characters 
requiring contextual treatment and hence not very many of these 
rules (and we don't expect that there will be), they will 
probably be much easier to implement by rule-specific tests 
(e.g., "if the character is X, then does it follow one of the 
following...") than by a general-purpose 
rule-interpreting-and-testing engine.   The specification should 
not, IMO, take a position on whether per-rule code or a general 
engine should be used.

However, that pair of choices affects the definitional method. 
If a general-purpose engine is to be possible, then the rules 
must be stated in a machine-processable way that it can 
interpret. Complexity in those statements may be a very bad idea 
and we may even want to compromise with having a precise focus 
in a given rule (e.g., the more relaxed "in that script" rather 
than "next to one of this list of characters).  If not, then it 
is only a requirement that the rules be stated precisely and 
unambiguously enough to be implemented unambiguously.

So far, the only way we have found to state the rules in a 
machine-processable form involves the use of Unicode extended 
regular expressions.   That form is used (or attempted to be 
used) in the current Contextual Rules Appendix in 
draft-ietf-idnabis-protocol-00.  It has proven somewhat clumsy, 
partially because the existing set of Unicode properties is not 
ideally matched to some of the statements that are needed.  To 
repeat the example mentioned in that Appendix, a test on "is an 
Indic script" would be much easier to state (and much more 
stable were more of those scripts added in the future) than a 
test on an explicit (and rather long) list of scripts.  That is 
almost certainly not the only example and it is not clear 
whether it would be reasonable to ask UTC for derived properties 
of this type, much less whether they would agree to create and 
maintain them.

In addition, as the note in the Appendix indicates, neither 
Patrik nor I are really happy about the use of Unicode regular 
expression syntax even as a definitional method, especially 
without more explanation than appears in the current text or 
even in UTS#18 (reference [Unicode-RegEx] in Protocol).  In 
particular, from my point of view (Patrik may have other 
reasons), use of those regular expressions for Contextual Rules 
involves some of the edge cases that remind us that there is no 
universal definition for a regular expression (RE) and that ones 
defined for the RE syntax of a particular language or system may 
be interpreted differently for another one.  That is not a good 
basis for interoperable standardization.   But we have not been 
able to find anything better and especially not anything that 
would be compatible with a general-purpose, table-driven rule 
interpreter.

So we either need a suggestion about a different approach or a 
lot of help getting the rules right.

Obviously one possible different approach would be procedural 
pseudo-code statements for each rule, leaving the would-be 
authors of a general rule interpreter on their own.  I 
personally don't like that approach in general but the tradeoffs 
here may push us in that direction.

Discussion needed, obviously.

Whatever conclusions we reach must be reflected in Section 
6.1.1.2 of Rationale and in the Contextual Rules material in 
Protocol.  And, of course, they affect the registration 
description discussed below.

(2) The registry itself and rules for updating it

Section 13.2 ("IANA Considerations: IDNA Context Registry") of 
Rationale contains a discussion of the updating rules for the 
Contextual Rules registry and a description of that registry. 
The descriptive material overlaps with material in the 
"Contextual Rules appendix" of Protocol, which attempts to be 
much more precise.  In addition to the question of what the 
registry entries look like (the controversial parts of which are 
discussed above), there is a question of an updating mechanism 
for the registry itself.   Almost by definition, any new 
characters that are classified as requiring context (ContextO or 
ContextJ) are going to be problematic.  If they were not, they 
would simply be Protocol-Valid.  "Problematic", in this context, 
implies that there may be significant disagreements about the 
rules (and whether the characters should be permitted at all). 
Resolving those disagreements properly may require considerable 
expertise about specific characters and scripts and their 
appropriate use in IDN/ DNS contexts-- a problem that is 
aggravated by the conflict between the well-known principle that 
it is usually easier to relax a rule than to make it more 
restrictive and the observation that broad and general rules may 
be easier to describe.

A review of the IETF's list of recommended review and approval 
mechanisms for IANA registries (RFC 5226/ BCP26) does not turn 
up anything that appears to be a good match for this situation 
(at least IMO).  So the WG needs to consider this situation 
carefully, define appropriate rules, and, if necessary be 
prepared to sell the IESG on an exceptional case.

       john