What rules have been used for the current list of
codepoints?
John C Klensin
klensin at jck.com
Thu Dec 14 23:43:04 CET 2006
Moving up a few thousand feet, I believe we are missing some
things here. In no particular order and with the understanding
that I am deliberately not attempting to be precise in order to
emphasize the principles...
(I got interrupted for several hours before finishing this; one
part of it is partially covered by some subsequent discussion,
but only part, so please bear with me.)
(1) As soon as someone says "X is ok because it can be used only
in context Y, even though it would be problematic in other
contexts", we have taken ourselves beyond rules about individual
code points and into a world in which one must be able to
rigorously define Y and the relationship. This would seem to
apply to various separators that would look like prohibited
characters in other contexts, such as non-breaking spaces,
shape-approximations to apostrophes. It would also apply to
any characters that would disappear (become invisible) outside
some specialized contexts in which they have a well-defined and
obvious impact on whatever surrounds them, such as ZWJ and ZWNJ.
And, finally, this impacts any model that suggests that certain
combining characters be permitted only in conjunction with
particular scripts. Look-ahead is not part of IDNA (or
nameprep/stringprep) now. While adding it is not infeasible, it
isn't a small step.
(2) One of our principles (or, if you prefer, meta-rules) is "no
non-language characters". The input from the user/ registrant/
good-DNS-behavior side of things has been pretty clear about
that. If we have a script that contains a number of
non-language characters, and those characters are not identified
by some class or property that permits them to be discriminated
from the rest of the script, then we have a problem -- either
with the principle or with the way we have so far gone about
this. That is the issue with the IPA Block: clearly it contains
some characters we need. Clearly it contains some characters
that violate this principle. If we can neither keep it nor
eliminate it, either the principle needs to give way or we need
a different criterion or rule set. It is not clear what those
might be.
(3) We cannot establish a principle that strings coming into
IDNA (or Nameprep) must already be normalized (to NFC at least).
The rule that NFKC(cp) must equal cp is well and good, but,
taken by itself, I think it eliminates all sequences involving
combining characters for which there are precombined sequences
and may have some other ill effects. Am I missing something in
this, or does the rule need further refinement (note that this
interacts with (1) above).
john
More information about the Idna-update
mailing list