Getting onto the 'Always' List: Different Views of Rule H (was: RE: New version, draft-faltstrom-idnabis-tables-02)

Fri Jun 22 23:24:49 CEST 2007

Hello again.

As I indicated in an earlier note, I find Rule H odious and
wish it were not necessary.  It, or some variation on it,
appears to be necessary and nothing in the recent threads have
convinced me otherwise (YMMD).  The problem to be solved, from
my perspective (which may not be identical to Patrik's or
Harald's earlier notes on the subject, but is probably
compatible) looks like the following.

We have discovered some cases in which we need, not only a rule
that a character is permitted, but also rules about where it
can appear (either what characters it can appear with or where
it can appear in the sequence of characters that make up a
label).  The obvious, and oft-cited, examples of this are the
hyphen in the original LDH rules (cannot appear at the
beginning or end of a label) and ZWJ/ ZWNJ (cannot appear
except in labels that are entirely in certain scripts or next
to particular characters -- see my (subsequent) note to Ken).
Some of us have suspected from time to time that there are some
potential edge cases involving combining characters, but the
nature of the problem exists whether those cases exist or not,
so let's not revisit that argument now.

We have also discovered, again extrapolating from our "LDH"
experience, but with some relatively solid examples elsewhere,
that are particular character may be a perfectly valid, stable,
and well-understood member of a script, but may still be unwise
or inappropriate for use in domain names;

Now, while we can make likelihood statements with great
confidence, I don't believe that any of us have sufficient
knowledge of writing systems that are as yet uncataloged and
uncoded to be positive that some future one will not come along
that needs new, context-dependent, treatment or some special
exclusion rules.

Worse, and independent of the contextual problems, we have seen
situations, even with scripts that are well-known and in wide
use, where experts may disagree about what is required (it
isn't a particularly good example, but the frequent claim that
English can be written in ASCII is an example here.  I gather
that some of the discussions about Hangul may involve other
"differences of opinion among experts", but don't understand
those discussions nearly well enough to know whether this is
another example).

Rule H is, I believe, intended to keep us out of the jaws of
the dragons -- the dragon of not knowing quite enough and the
dragon of cultural and political relationships and emotion
about them -- along our path.  It is both a matter of substance
(e.g., does a new character raise contextual issues that
require special rules, whether regular expression-based or
otherwise?), of appropriateness (e.g., does one want
presentation-specific or position-specific characters in IDNs
and what harm is done about including or excluding them?), and
of authority.

The authority question is ultimately about how "we" decide that
a script, or a subset of a script, is ok (move appropriate
characters into "always").  The answer, I believe, is that
neither IETF nor UTC should try to make that decision although
it is critical that we impose bounds on it: no matter how much
various of us know or think we know, we are not going to be
considered authorities on a particular script by the linguistic
experts or language authorities who specialize in the languages
that use that script.  Part of this, bluntly, is about
cultural-linguistic politics: to take a language example, if
Académie française says "those tables are acceptable for
representing a reasonable set of mnemonics based on French" we
should presumably believe them and, if someone later objects
based on a different model, tell them to work it out with the
Académie.  If some organization based in California or Virginia
(e.g., ICANN, Unicode Consortium, IETF, or ISOC) makes the same
claim and gets into an argument, we are in big trouble because
those who are affronted won't recognize us as an appropriate
authority.  That would puts us at risk of having to try to make
retroactive and potentially very disruptive changes. 

Even if the model that implies is adopted (and I realize that
I'm being a little vague here), there still needs to be a
process for recognizing language authorities and, more
important for IDN purposes, aggregates of language authorities
who can take reasonably authoritative positions about the use
of scripts or their components at the level of the protocols
themselves.   I'll address that in a separate note, but
probably not today.

That, however, brings us to the example of Han characters.
Completely independent of whatever issues do or do not exist
with Unicode coding and what the Unicode property categories
can tell us, I believe that we have a worked example of the
ideal way to handle a script that supports multiple languages.
In order to solve a somewhat different problem (of registry
restrictions and confusion-avoidance), a "Joint Engineering
Team" came together.  Governments and linguistic experts in the
relevant CJK countries were consulted and worked together to
produce two things.  One was the "JET Recommendations" (RFC
3743) which is about registry procedures and irrelevant to our
protocol needs.  The other was a set of tables (the Chinese one
is described in RFC 4713 and that table and others appear at
http://www.iana.org/assignments/idn/registered.htm) that,
together, probably constitute, not a list of Han characters
that meet some Unicode or other criterion based on the basic
properties of the characters, but a list selected for
appropriateness for IDN use.  Now, one would not want to use it
that way without having a conversation with the JET team and
(directly or indirectly) their linguistic advisers, but one
model of the theory behind "Rule H", is that at least the
intersection of the "permitted" Han characters in the Chinese
(CNNIC, TWNIC, HKNIC, MONIC), Japanese (JPNIC), and Korean
(KRNIC) tables for RFC 3743 use are the "always" subset of the
Han script and that any other Han characters go into "maybe"
somewhere... until and unless a case is constructed, with
appropriate authority, for adding them.

It would, of course, be interesting to see how similar that
list is to the list built up from Unicode properties.  But, if
they are different, a combination of the JET-based lists would
appear to be the preferable one because it was constructed by
local experts, working to the same set of assumptions,
representing all of the languages that use the script, and
evaluating characters specifically for their appropriateness to
IDNs.  

I don't expect that we will see that level of collaboration and
cooperation with many other scripts that support more than one
language, but it is the target to which we should aspire, IMO.
I don't know if the logical representatives of language groups
who share a script can't manage to get along well enough to
develop recommendations, but we can hope that a policy of "stay
on the 'maybe' list until you can agree" will help us out.
Other suggestions welcome, of course, but trying to externally
impose rules on the users of a script and the cultures involved
does not strike me as an idea with a good future.

     john