What rules have been used for the current list of codepoints?

Fri Dec 15 00:29:31 CET 2006

Michael said:

> > You can't finesse this algorithmically.
> 
> I thought this was why we had classes. What stops us from having a  
> new class that is "suitable for domain names"?

O.k. *throw caution light*

"class" in this discussion should be restricted to referring to
one of the values of the General_Category propery. Thus
General_Category=Lu is the "class" of uppercase letters,
General_Category=Nd is the "class" of numeric digits, and so on.

There are something in the vicinity of 100 defined Unicode
character *properties*. General_Category is just one of those
properties -- perhaps the most prominent and important of them,
but one, nonetheless.

What Patrik is talking about is defining a new Unicode
character *property*, a binary property, by the way, that
would partition all Unicode code points into those that
are "suitable for domain names" and those that are not.

And of course, there is nothing in principle which stands in the
way of doing that. One of the points of having new properties
or particular derived properties is to make them available to
specifically define sets of Unicode characters appropriate for
particular behavior in one algorithm or another; there really
is nothing new about that in the IDNA context.

The *substantive* issue, however, is that defining a
property requires having a very, very clear definition of
the set intended to be referenced by that property.
And that is precisely the exercise that Mark and I are
engaging in here, to specify the list of criteria for
membership in the set that at the end of the process might
end up identified externally simply by means of a maintained
single property: [+/- IDNAinclusion] or some such.

The fact that the definition and derivation of an IDNAinclusion
binary Unicode property itself might be somewhat complex,
in and of itself, is o.k. Note that the FC_NFKC_Closure
Unicode character property referenced in klensin-idnabis-issues
is itself a complex, derived property. The reason for
defining it is to make it simpler to just have a table
listing all the characters that have that derived property.

But while a derivation can be complex -- and that is o.k. --
it cannot be totally ad hoc and involve application and
registry of contents. If you go that route, you can
neither maintain nor stabilize the property, because you
never know when somebody is going to require that some
arbitrary additional thing needs to be added, and what
criteria they might be bringing to the table to justify
their application.

On the other hand, if a derivation is by rule (as Mark's
suggestion has been), then extensions and maintenance are
mostly automatic. Furthermore, if the derivation depends
mostly on other stable properties, then it is much
easier to provide stability guarantees for the property
as well.

As discussed yesterday, *formally* the UTC can stabilize
any property by the following means.

Property X (derived from union of Y + Z)

At at future point, for whatever reason, some character
gets added to the set of characters defined by property Y
and some other character gets removed from the set
of characters defined by property Z.

UTC response, if Property X *must* be stable:

Define Other_Add_To_X property, and assign it to the character
removed from the set of characters defined by property Z.

Define Other_Subtract_From_X property, and assign it to the
character added to the set of characters defined by propery Y.

Redefine Property X as derived from [Y + Z + Other_Add_To_X -
Other_Subtract_From_X]

This is messy, of course, and it ends up in the maintenance of
these weird little exception set "Other_xxx" properties, but
it is a formal means whereby any existing property can be
completely stabilized, if there is sufficient grounds for
doing so.

Before we go there for a potential IDNAInclusion property,
however, it would be best to explore the route the Mark
is taking, and see if we can't live with the results of
a straightforward (if somewhat complicated) derivation
and convince ourselves that the results of that derivation
when Unicode 5.1 and Unicode 6.0 and so on eventually are
published, will continue to be completely appropriate for IDNA
(and in particular StringPrep) purposes.

> I.e. my point is that the list of rules already now (before we start  
> doing individual inspections of code points) is quite complex. We  
> have already discussed how to explain the rules so that they are not  
> confusing in what to expect if more than one rule matches etc.

--Ken