What rules have been used for the current list of codepoints?

Thu Dec 14 23:43:04 CET 2006

Moving up a few thousand feet, I believe we are missing some 
things here.  In no particular order and with the understanding 
that I am deliberately not attempting to be precise in order to 
emphasize the principles...

(I got interrupted for several hours before finishing this; one 
part of it is partially covered by some subsequent discussion, 
but only part, so please bear with me.)

(1) As soon as someone says "X is ok because it can be used only 
in context Y, even though it would be problematic in other 
contexts", we have taken ourselves beyond rules about individual 
code points and into a world in which one must be able to 
rigorously define Y and the relationship.   This would seem to 
apply to various separators that would look like prohibited 
characters in other contexts, such as non-breaking spaces, 
shape-approximations to apostrophes.   It would also apply to 
any characters that would disappear (become invisible) outside 
some specialized contexts in which they have a well-defined and 
obvious impact on whatever surrounds them, such as ZWJ and ZWNJ. 
And, finally, this impacts any model that suggests that certain 
combining characters be permitted only in conjunction with 
particular scripts.  Look-ahead is not part of IDNA (or 
nameprep/stringprep) now.  While adding it is not infeasible, it 
isn't a small step.

(2) One of our principles (or, if you prefer, meta-rules) is "no 
non-language characters".  The input from the user/ registrant/ 
good-DNS-behavior side of things has been pretty clear about 
that.    If we have a script that contains a number of 
non-language characters, and those characters are not identified 
by some class or property that permits them to be discriminated 
from the rest of the script, then we have a problem -- either 
with the principle or with the way we have so far gone about 
this.  That is the issue with the IPA Block: clearly it contains 
some characters we need.  Clearly it contains some characters 
that violate this principle.  If we can neither keep it nor 
eliminate it, either the principle needs to give way or we need 
a different criterion or rule set.  It is not clear what those 
might be.

(3) We cannot establish a principle that strings coming into 
IDNA (or Nameprep) must already be normalized (to NFC at least). 
The rule that NFKC(cp) must equal cp is well and good, but, 
taken by itself,  I think it eliminates all sequences involving 
combining characters for which there are precombined sequences 
and may have some other ill effects.  Am I missing something in 
this, or does the rule need further refinement (note that this 
interacts with (1) above).

     john