SASLprep200x

Thu Jan 11 01:05:38 CET 2007

Patrik said:

> My view, that was sort of invented during a talk with Cary the other  
> week, is that we can see it like this:
> 
>   1 We have a set of theoretical codepoints. 0x0000 and up.
>   2 Unicode Character Set include for each version a subset of those  
> codepoints.
>   3 Stringprep allow a subset of the Unicode Character Set.
>   4 A profile of stringprep (like Nameprep) is a subset of stringprep  
> allowed "stuff".
>   5 Registry policy talk about a subset of Nameprep.
>   6 Registrar policy talk about a subset of the registry policy.
>   7 User interface issues is a subset of the registrar policy (might  
> at least create a subset).
> 
> I really wanted it to be "7 layers" ;-)

This is neat and no doubt has its appeal here, but isn't
actually very helpful in this discussion.

As for the levels 1 and 2, that is an inaccurate representation
of the Unicode character encoding model, and mixes notions of
code points and encoded characters. A subsetting relationship
is not accurate or relevant expressed in that way.

What one has instead is a character encoding architecture
for Unicode that has 1,114,112 code points, designated by
integers 0..10FFFF. That is stable, has been the case since
Unicode 2.0, and will remain stable as long as Unicode is a
concern of anyone. So Unicode 5.0 has the same 1,114,112
code points, and there is no subsetting involved.

Technically, you could point out that Unicode has a subset
of the code points (= "code positions") allowed in the
International Standard ISO/IEC 10646, because 10646 still
provides for 2,147,483,648 code points, designated by
integers 0..7FFFFFFF. But in practice there is no difference
and no subsetting involved, because the current language
in 10464 specifies that 110000..7FFFFFFF are "permanently
reserved", which effectively removes them from the table
as far as anyone's implementations of 10646 are concerned.

Encoded characters are another animal. They are a mapping
(or "assignment") of an abstract character *to* a code 
point. And the cross-version subsetting relation which
is pertinent and guaranteed by the stability policies
of the UTC and WG2 is that the repertoire of encoded
characters for any given version of the Unicode Standard
(starting with Unicode 1.1) is a subset of the
repertoire of encoded characters of all subsequent
versions of the standard. (The same relationship can
be described for each edition of 10646 vis-a-vis any
amendments to that edition, although with slightly
different details regarding the exact sets of characters
in the published repertoires at any given time.)

Short summary to this point: the set of code points and
the set of encoded characters in a particular version
of Unicode are comprised of different element types
defined on different domains. One is not a subset of
the other.

>   3 Stringprep allow a subset of the Unicode Character Set.

This is a misleading summary in a couple of ways.

First, while this might seem like quibbling, the Unicode
Standard nowhere talks about "the Unicode Character Set".
If it meant anything, it would presumably refer to Unicode,
in the abstract, as an architecture for encoding characters,
without reference to particular versions.

In the case of Stringprep and our discussion here, I think
we need to be talking instead about:

  Stringprep specifies, for any given version of the
  Unicode Standard, a particular allowed subset of the
  repertoire (of encoded characters) of that version.

And the trick for defining Stringprep most usefully is to
make it so that the specification of the subset of the
repertoire is evident and easily derivable for each
subsequent version of the Unicode Standard (5.0 and onwards)
without having to change the specification in Stringprep itself.

Second, and more problematical, Stringprep doesn't actually
define a set of allowed characters, but rather has to
define a set of allowed *strings*. Part of that -- the
most obvious part, of course -- is specifying which characters
are allowed in the strings, but the other part, having to
do with placement of combining marks, bidi, and so on (including
any decisions we take about allowing ZWJ or ZWNJ in specific
contexts), must logically be taken as making the domain
of Stringprep to be *Unicode strings*, not *Unicode characters*.

>   4 A profile of stringprep (like Nameprep) is a subset of stringprep  
> allowed "stuff".

As Erik pointed out, a profile of Stringprep (at least as
currently defined), could *add* characters, as well as subtract
them. And Stringprep really has to do with strings, not just
characters, and a profile could, in principle, otherwise impact
the definition of allowed strings. So once again, a *subset*
relationship is not appropriate here.

I'm not going to comment about the appropriateness of the
use of subsetting or even the terminology involved in Patrik's
levels 5, 6, and 7, because those are not the area of my
expertise... I just don't know what registries and registrars
do here and how they view the impact of their policies and
restrictions to be on the sets of strings they are concerned
with for domain names.

Despite all my criticism here, however, I do think there is
something to be taken away from Patrik's "protocol stack".
The area we should be focussing on here is precisely what
he has labelled Level 3 and 4: Stringprep and Nameprep.
(Whether they actually constitute 2 "levels" is up for
grabs, actually -- it depends on how the specifications
are written.)

What Patrik characterized as Level 1 and 2 are out of scope.
The business of defining the repertoire of the Unicode
Standard belongs to the UTC and WG2.

What Patrik characterized as Levels 5 - 7 are also out
of scope. We clearly need to talk about them, but I think
mostly in the context of understanding what we *shouldn't*
be trying to accomplish in the idna-update scope of
work.

> What we talked about Cary and myself was further, what are global  
> policies and what are local ones. My view is that 1-4 are global  
> policies, without connection to "languages",

I don't think we're talking about "policies" here, but I
concur with the essence of this observation -- that the
definition of Stringprep and Nameprep specifications should
not be done with language-specific concerns in mind. They should
concern strings in general and should specify restrictions,
if any, in terms of inherent properties of characters --
including their script identity -- that aren't defined on
a language-by-language basis.

In other words, whatever Stringprep and Nameprep do, they
should have nothing to do with orthographies or spelling
rules, and such, but only with generic well-formedness
constraints on string formation and restrictions on the
use of various types of characters.

> while 5-7 can be "local  
> policy" and include connections to languages and whatever else. 

I agree.

> Further, 1-4 do not reference individual codepoints, but instead  
> "classes of characters" or similar.

See above. Levels 1-2 are out of scope here, and in the
case of specification of the repertoire of particular
versions of the Unicode Standard absolutely *does* deal with
individual code points.

However, Stringprep and Nameprep specifications should, I agree,
be concerned with classes of characters. It is a mistake,
however, to elevate that to an absolute principle, precisely
because there are exceptional characters. The need to talk
specifically about ZWJ and ZWNJ, for example, is a case in
point. 

Furthermore, there is no guarantee, a priori, that
any given assemblage of existing Unicode character properties
will provide exactly the right subsetting of repertoire to
give best results for Stringprep and Nameprep. Rather, what
we could guarantee is that once an appropriate set of criteria
for inclusion in Stringprep and/or Nameprep are agreed upon,
a Unicode character property could be defined *a posteriori*
and be maintained in such a way as to get appropriate results
for future versions of the Unicode Standard.

> Because of this, a rule like "combination character ring above can  
> only exists after latin letter a" can only exists in rule set 5-7,  
> while "combination character of type foo can only be after base  
> character bar" can at least theoretically be also in rules 1-4 (in  
> reality 3-4).

I think the concept is mostly correct here, but in detail, no.

Rather I'd say that rules of the sort "only letters used in
the standard orthography of Bulgarian" belong in policies
established at your levels 5 - 7. Rules of the sort "ZWJ is
only allowed in the following constrained context", on the
other hand, while referring to a specific character, rather
than a character property, properly belong in Stringprep
and/or Nameprep.

> Yes, we have seen codepoints that do have exactly the same properties  
> and all, but still have to have different rules somewhere in the  
> architecture I outlined above. This is a case where I think UTC  
> should have a more careful look at whether their definitions are  
> correct and/or whether they have to redo some classification and/or  
> add some more metadata to the codepoints.

The fallacy here is in assuming that *existing* defined character
properties are exhaustive *and* definitive for *all* conceivable
future processing of Unicode characters. In fact this has
been disproven again and again, which is why the UTC keeps defining
new character properties. The need to give a rigorous definition
of a new algorithm (e.g. linebreaking, or wordbreaking, or
normalization, etc.) often leads directly to the definition
of new character properties whose definitions are carefully
tailored to produce the right results for the algorithm.

I really see no reason to expect anything different of Stringprep
and Nameprep. The engineering requirements for IDNA are
distinct from those for normalization or case folding, for
example, so there should be every expectation that since
IDNA involves specifying a restricted set of allowed characters,
that that task may not map *exactly* onto existing Unicode
character properties.

Case in point: the Unicode Character Database nowhere defines
a formal property for "historic" for a character. But we seem
to have consensus here that it would be a good idea to
restrict IDNA by omitting most (if not all) of the historic
scripts in Unicode from the get go. Unless you just want to
give up on that general idea and allow Sumero-Akkadian cuneiform
and Gothic and Phoenician and, and... into IDNA, you basically
have no alternative but to agree that (some X) are in and
(some Y) are out, and then *a posteriori* define a
property that ratifies the set definition you have put
together based on other criteria.

> If we in the IETF process discover that codepoints are to be treated  
> differently while UTC give exactly the same classification and all to  
> them, then something is broken.

What is broken is the assumption that any one classification
applies to all processes.

--Ken