Combining characters and accents
John C Klensin
klensin at jck.com
Mon Nov 27 20:36:01 CET 2006
Just to pull some threads back together and stop an explosion of
As I understand it from Patrik and other discussions, the
combining characters were deemed unacceptable in the most recent
versions of the tables not because of some plan about a
requirement for precomposed characters, or some misunderstanding
of NFKC, etc., but very simply because "Script Inherited" was
excluded because "Inherited" is not the name of a script used in
any language (or really a script at all).
Ok. That wasn't a good idea -- too blunt an instrument. What
we need to do now is to try to figure out if there is some way
to separate the reasonable ones (and, ideally, the reasonable
contexts) from those that are not. I don't know nearly enough
to have suggestions although some ideas that Sam and Michael
have suggested seem to me to be worth pursuing.
For future planning, we could, IMO, have gotten to "not a good
idea, try something else" much sooner had the comments been of
the character of "you have excluded that script; that doesn't
work because..." rather excited-sounding claims about particular
characters and languages. The latter are fine, but prevented us
from quickly getting to "can't exclude all of 'Script
Inherited'" because we were looking at individual characters.
We have also learned something else which is a little scary (at
least to me): we cannot simply look at a codepoint and say
"valid" or "not valid". For a combining-anything, we may need
to look at a greater or lesser amount of context -- or we need
to conclude that the combination of a diaeresis with a Chinese
character is strictly a registration-time problem. If we can
look enough at context, then the issues with apostrophoids
(thanks, Cary) and non-spacing breaks get much easier: if they
can appear only in places where they naturally appear, are
required, and have some clear presentation impact, then the
issues of confusion with ASCII single quotes and invisible
characters in arbitrary strings largely disappear.
But, we haven't yet figured out how to reflect context in the
type of tabular work Patrik is doing: if it requires changes to
the IDNA algorithm, rather than "merely" the tables, it implies
a more extensive change than we had hope to have to make.
As Vint has pointed out in other notes, this is an iterative
process to see how close we can get to some optimal (and still
not precisely known) set of IDN-permitted characters by applying
what appear to be sensible rules from the collection of Unicode
character classes, script names, and [other] properties,
including, yes, block structure. We don't assume that any of
these, especially the blocks, are going to be useful but,
because they are there, they may be worth trying to see if they
bring us any closer. If one approach doesn't work, that is
fine: we move on and try something else. This process isn't
monotone convergent: no would should expect that. And it also
isn't necessarily as efficient as it might be in a perfect world
-- too bad, but that is how it goes.
What I think we are trying to do right now is to determine
whether it is possible to build an IDN-satisfactory list of
characters purely by manipulation of ("an algorithm over")
existing Unicode structure, classes, and properties. If we
conclude that we can't do that, then we are going to have to
start working through Unicode to generate an IDN-appropriate
property based on some criteria that I hope (but don't expect at
this point) can be simple and objective.
No one expected this to be easy.
More information about the Idna-update