Outstanding issues(4): Contextual rule definitions and
registry
John C Klensin
klensin at jck.com
Sun May 25 20:55:40 CEST 2008
In the hope of getting some discussion going that focuses on
unresolved issues in the WG's charter, I'm about to post four
notes that contain a list of substantive outstanding issues and
loose ends with the documents for which I hold the pen. This
is the fourth of those four. Please, for all four, if you open
up significant new topics, change the subject line. And, for
this one and the second and third, please use separate threads
for each issue so that we can discuss them, rather than
addressing omnibus notes to the editor: if these topics were not
at least somewhat uncertain or controversial, they would have
been resolved and reflected in the documents by now.
The major innovation in the IDNA2008 proposals, at least IMO, is
the idea that characters that are plausible for IDN use only in
specific contexts can be used by defining and testing those
contexts on a "full label" or "adjacent character(s)" basis as
needed. The mechanism raises a number of issues and the text in
Rationale and Protocol (especially the "Contextual Rules
appendix") that describes it contains some significant
handwaving (of which the design team is obviously aware).
This note tries to summarize the key outstanding issues in the
hope of eliciting focused WG discussion and suggestions about
resolution.
(1) Definition of the rules.
It turns out that, as long as there are not very many characters
requiring contextual treatment and hence not very many of these
rules (and we don't expect that there will be), they will
probably be much easier to implement by rule-specific tests
(e.g., "if the character is X, then does it follow one of the
following...") than by a general-purpose
rule-interpreting-and-testing engine. The specification should
not, IMO, take a position on whether per-rule code or a general
engine should be used.
However, that pair of choices affects the definitional method.
If a general-purpose engine is to be possible, then the rules
must be stated in a machine-processable way that it can
interpret. Complexity in those statements may be a very bad idea
and we may even want to compromise with having a precise focus
in a given rule (e.g., the more relaxed "in that script" rather
than "next to one of this list of characters). If not, then it
is only a requirement that the rules be stated precisely and
unambiguously enough to be implemented unambiguously.
So far, the only way we have found to state the rules in a
machine-processable form involves the use of Unicode extended
regular expressions. That form is used (or attempted to be
used) in the current Contextual Rules Appendix in
draft-ietf-idnabis-protocol-00. It has proven somewhat clumsy,
partially because the existing set of Unicode properties is not
ideally matched to some of the statements that are needed. To
repeat the example mentioned in that Appendix, a test on "is an
Indic script" would be much easier to state (and much more
stable were more of those scripts added in the future) than a
test on an explicit (and rather long) list of scripts. That is
almost certainly not the only example and it is not clear
whether it would be reasonable to ask UTC for derived properties
of this type, much less whether they would agree to create and
maintain them.
In addition, as the note in the Appendix indicates, neither
Patrik nor I are really happy about the use of Unicode regular
expression syntax even as a definitional method, especially
without more explanation than appears in the current text or
even in UTS#18 (reference [Unicode-RegEx] in Protocol). In
particular, from my point of view (Patrik may have other
reasons), use of those regular expressions for Contextual Rules
involves some of the edge cases that remind us that there is no
universal definition for a regular expression (RE) and that ones
defined for the RE syntax of a particular language or system may
be interpreted differently for another one. That is not a good
basis for interoperable standardization. But we have not been
able to find anything better and especially not anything that
would be compatible with a general-purpose, table-driven rule
interpreter.
So we either need a suggestion about a different approach or a
lot of help getting the rules right.
Obviously one possible different approach would be procedural
pseudo-code statements for each rule, leaving the would-be
authors of a general rule interpreter on their own. I
personally don't like that approach in general but the tradeoffs
here may push us in that direction.
Discussion needed, obviously.
Whatever conclusions we reach must be reflected in Section
6.1.1.2 of Rationale and in the Contextual Rules material in
Protocol. And, of course, they affect the registration
description discussed below.
(2) The registry itself and rules for updating it
Section 13.2 ("IANA Considerations: IDNA Context Registry") of
Rationale contains a discussion of the updating rules for the
Contextual Rules registry and a description of that registry.
The descriptive material overlaps with material in the
"Contextual Rules appendix" of Protocol, which attempts to be
much more precise. In addition to the question of what the
registry entries look like (the controversial parts of which are
discussed above), there is a question of an updating mechanism
for the registry itself. Almost by definition, any new
characters that are classified as requiring context (ContextO or
ContextJ) are going to be problematic. If they were not, they
would simply be Protocol-Valid. "Problematic", in this context,
implies that there may be significant disagreements about the
rules (and whether the characters should be permitted at all).
Resolving those disagreements properly may require considerable
expertise about specific characters and scripts and their
appropriate use in IDN/ DNS contexts-- a problem that is
aggravated by the conflict between the well-known principle that
it is usually easier to relax a rule than to make it more
restrictive and the observation that broad and general rules may
be easier to describe.
A review of the IETF's list of recommended review and approval
mechanisms for IANA registries (RFC 5226/ BCP26) does not turn
up anything that appears to be a good match for this situation
(at least IMO). So the WG needs to consider this situation
carefully, define appropriate rules, and, if necessary be
prepared to sell the IESG on an exceptional case.
john
More information about the Idna-update
mailing list