What rules have been used for the current list of codepoints?

Fri Dec 15 01:29:01 CET 2006

Trying not to repeat things that Mark has already responded
on these topics...

> (1) As soon as someone says "X is ok because it can be used only 
> in context Y, even though it would be problematic in other 
> contexts", we have taken ourselves beyond rules about individual 
> code points and into a world in which one must be able to 
> rigorously define Y and the relationship.   This would seem to 
> apply to various separators that would look like prohibited 
> characters in other contexts, such as non-breaking spaces, 
> shape-approximations to apostrophes. 

I don't think we have any particular problems for spaces,
at least. Nobody here is proposing the gc=Zs be included.

I'm guessing that the "shape-approximations to apostrophes" is
a reference to issues like the Hebrew geresh and gershayim,
which are formally of the punctuation class (gc=Po), but which
are needed for ordinary Hebrew orthography. I think such
cases should be argued on a script-by-script basis when we
get to them, but I agree that this is unlikely to be
fruitfully argued in a "can only be used in context Y" framework.

My take here is that if a character is in the inclusion list
for IDNA (and StringPrep more generally) it is in the list,
and IDNA itself isn't going to define script-specific contexts
of usage. Such prescriptions would have to either be up to
individual registrars, as you suggest in klensin-idnabis-issues,
or would be a matter for end user agents to deal with, along
the lines suggested in UTR #36, if lookalikes or cross-script
spoofing, or other problems raise security issues.

> It would also apply to 
> any characters that would disappear (become invisible) outside 
> some specialized contexts in which they have a well-defined and 
> obvious impact on whatever surrounds them, such as ZWJ and ZWNJ.

See what Mark suggested on ZWJ and ZWNJ.

> And, finally, this impacts any model that suggests that certain 
> combining characters be permitted only in conjunction with 
> particular scripts.

I don't think that step is desirable or necessary. It doesn't
fit the usage model of Unicode, either. We are better off simply
removing unneeded combining characters (such as those used
in religious text annotation) from the inclusion list, and stopping
there.

> Look-ahead is not part of IDNA (or 
> nameprep/stringprep) now.  While adding it is not infeasible, it 
> isn't a small step.
> 
> (2) One of our principles (or, if you prefer, meta-rules) is "no 
> non-language characters".  The input from the user/ registrant/ 
> good-DNS-behavior side of things has been pretty clear about 
> that.

I concur with this, by the way. The rub is that people may
have different notions of what "non-language characters" mean.

What *I* mean by (not(non-language characters))are those characters
which are known to be or are likely to be used in orthographies
for the representation of words in natural languages. That
is why we start off with letters (and syllables and ideographs)
gc = [Lu, Ll, Lo, Lm] and combining marks gc = [Mc, Mn], because
those classes are set up to correspond more or less to that
notion of characters used in orthographies for the representation
of words in natural languages. That criterion omits all the
problematical junk (symbols, punctuation, controls, spaces, etc.)
in one fell swoop, and is an approximation to what is needed.

Add in digits because people *do* use them in internet identifiers.

Then take a hard look at the resulting set to pull out stuff
not needed, either because it consists of letters only used
in extinct, historic writing systems, for example, or because
they are combining marks only intended for symbols.

What I am sure we shouldn't do, *particularly* for the Latin script,
is assume that "language characters" should be defined by
starting a catalog of known alphabets of known languages, and
only adding a character to the set if we document its use
in German or Swedish or Estonian or Maltese or Igbo or ...
(iterated several thousand times).

>    If we have a script that contains a number of 
> non-language characters, and those characters are not identified 
> by some class or property that permits them to be discriminated 
> from the rest of the script, then we have a problem -- either 
> with the principle or with the way we have so far gone about 
> this.  That is the issue with the IPA Block: clearly it contains 
> some characters we need.  Clearly it contains some characters 
> that violate this principle.

I would claim that the only IPA characters that you could
hope to justify that claim for would be the 5 IPA characters
for disordered speech (U+02A9..U+02AD). And frankly I don't
think they are worth wasting our time agonizing over. Having
them in the IDNA inclusion list isn't going to hurt anyone
even if nobody every comes up with an official orthography 
with a letter for a lisp, lip smack, or teeth gnashing.
(Even though, at this point in the discussion, I sometimes
feel like putting the teeth gnashing in. ;-) )

But if somebody is *truly* perturbed about such things, then
such letters can simply be omitted from the derivation of
the IDNA inclusions list by a range-based exclusion rule.

> If we can neither keep it nor 
> eliminate it, either the principle needs to give way or we need 
> a different criterion or rule set.  It is not clear what those 
> might be.

Or... you seek the third alternative. The principle is a fine
one, and should guide the creation of the inclusion table.
The *problem* is believing that it is an absolute principle,
that it is well-defined (or well-definable if we just spend
enough time gathering input from the "language communities")
and that we can write an unambiguous rule based on it to
generate the IDNA inclusion table.

Principles should guide the discussion and the approaches
one takes. Then you look for *explicit* criteria and rules,
when it comes to actually creating a table that can be
built and maintained algorithmically.

--Ken