Table-building

Kenneth Whistler kenw at sybase.com
Fri Feb 2 02:28:48 CET 2007


John,

> We have an external mandate to get the symbols, drawing
> characters, punctuation, dingbats, etc., forever out of IDNs.
> "Out" as in "banned from registration, banned from lookup".
> That list, in terms of number of code points, is somewhat larger
> than the one you have suggested above.  It is also likely to
> grow if you add characters of those varieties to future versions
> of Unicode.

We all agree. I don't see why we are chasing our tails here.
Those haven't been in anybody's draft of a new inclusions
table, and nobody is proposing that they be added.
  
> If one could assume that those characters could be handled by
> simply banning their registrations, then I would agree with you
> and Ken -- that "banned" ("#2") list would not be a matter of
> great concern, especially for implementers.  But, as we have
> discussed in another context, there is no enforcement mechanism
> that permits us to assume that all registries, at all levels of
> the DNS tree, will behavior reasonably, nor that some of these
> characters will not turn out to be good ways to spoof other
> things (the standard example for this has become "things that
> look like '/'", but there are others -- how many depends on how
> paranoid one is and what assumptions are made about fonts and
> glyphs).

So you write IDNAbis nameprep as you have indicated.

nameprep for registration: Such characters are not in the
  inclusion table in the first step. Strings are not valid.
  MUST NOT register.
  
nameprep for resolvers: Such characters are not in the
  inclusion table in the first step. Strings are not valid.
  MUST NOT resolve.
  
What is so difficult to agonize over here?


> As far as the middle-dot is concerned as an example of why one
> can't do this, I believe it is an example of something else --
> something that goes back to the intent of the original IETF-UTC
> agreement about stability.   To get away from that particular
> example, 

No, please go back to that particular example, instead of
leaving us hanging on tenterhooks.

Patrik intimate that we have a "problem" for Catalan.
We either resolve that one way for Catalan (and by the
way for many, many other orthographies that use a middle
dot -- this is not just a Catalan problem, although it
is only a cause célèbre in Catalunya) by adding U+00B7
MIDDLE DOT to the inclusions table or by not adding
it to the inclusions table.

Yes or no, please, and don't wiggle out of this by going
on to hypothetical Martian left wiggles. ;-)

> if you identify MARTIAN LEFT WIGGLE at U+90005 as
> punctuation in one version of Unicode, and then change your
> minds and decide it is really a letter (with or without some
> specific adjacency requirements), our expectation is that you
> will deprecate it in place and allocate a new MARTIAN LETTER
> LEFT WIGGLE at some other code point.  That new code point would
> then go into either "pending" or "ok", depending on other
> decisions.

Your expectations are completely unreasonable, in part because you
have attempted to wiggle out of the U+00B7 issue.

U+00B7 can*NOT* be deprecated. It is ISO 8859-1 Latin-1.
It is a common use character.

It also can*NOT* be disambiguated. Sorry, but that horse is
20 years out of the barn, because people have been using it
both ways for years.

There are cases where a newly added character may be
disunified later, if it turns out that the attempted
unification of disparate functions was unsuitable.

You can see examples of that in the Unicode Standard.
For example, the mathematical brackets:

U+27E6 MATHEMATICAL LEFT WHITE SQUARE BRACKET

was added to the standard, to distinguish it from the
earlier encoded:

U+301A LEFT WHITE SQUARE BRACKET

And this, despite the fact that almost all the Unicode
character properties for these are identical. The problem was that
U+301A is generally supported by East Asian fonts that
have in appropriate glyph metrics for use in mathematical
brackets. Accordingly, they are given distinct east Asian
width properties:

U+27E6 eaw=Narrow
U+301A eaw=Wide

But that doesn't result in the deprecation of U+301A. Far
from it.

The kind of MARTIAN LEFT WIGGLE you are searching for in
the real world can be found in:

U+17D7 KHMER SIGN LEK TOO

In Unicode 3.0 (and 3.1), that was given the General_Category=Po,
in other words, it was assumed to be a punctuation "sign", based
on insufficient information about use of repetition signs in
SE Asian scripts.

In Unicode 3.2 (and since), that has the General_Category=Lm,
in other words, it is treated as a modifier letter, and is
*included* in the definition of identifiers (and words).

That change resulted from feedback from SE Asian script implementers
about the handling of the repetition signs -- not only in Khmer,
but also in Thai and Lao. Cf. U+0E46 THAI CHARACTER MAIYAMOK,
which is the graphological cognate.

Now this particular case was under the wire for your concerns,
since these were fixed as of Unicode 3.2. But this kind of
clarification is always possible for newly encoded characters,
despite our best efforts -- particularly as details for more
obscure scripts become harder to nail down, and experts get
fewer and harder to come by.

But expecting that the UTC would approach this particular
problem by *deprecating* the existing U+17D7 KHMER SIGN LEK TOO
and cloning a new one
with the corrected properties is completely unreasonable.
Ain't gonna happen. That would be the tail wagging the dog
again, because some protocol wanting stability and freedom
from worry would be *imposing* a requirement of invalidation
of existing data, respellings, outdating of existing
implementations and dictionaries and collations and indices
and yadda yadda yadda. It would cause massive disruption to
treat a character standard that way.

Fundamentally, I think you are misrepresenting the issue
here.

Instead of:

   The UTC promises never to include gc=Po characters
   in IDN, so if they ever encode a character and give
   it that property, they can never change it to gc=Lo
   (or whatever), which would force it into IDN's and
   screw us up, because we'd end up with punctuation
   in IDN's, and there is an external mandate that
   prohibits that.
   
This should be formulated as:

   The UTC promises never to include the unreasonable
   symbolic and punctuation crap that nobody wants
   in IDNs in the IDN_Permitted property. If a newly
   encoded character gets the gc=Po property, by
   general rule it won't be included in IDN_Permitted.
   If, however, it turns out that a newly encoded
   character actually is a letter of some sort, and
   was mistakenly given the gc=Po property, and *if*
   it is important enough for IDN's that some community
   wants it added, then the UTC will correct its
   General_Category to gc=Lo (or whatever) and include
   it in IDN_Permitted in a future revision of the
   standard, just as it would have if the property
   were correctly identified from the start. Furthermore,
   the UTC will *never* remove a character from
   IDN_Permitted, no matter *what* other property
   corrections might turn out to be warranted for it.
   

> 
> So I don't see it as a problem if the UTC can accept the
> position that, as long as applications of various sorts are
> dependent on the property list associated with a given
> character, you cannot, in general, change the properties: a
> serious enough mistake means that you need to allocate a new
> code point with a new set of properties.  If that isn't a
> reasonable model, then I think we are in considerable trouble.

See above, and please think it through again.

The UTC has a *lot* of experience in this, and has very,
very detailed policies in place about what can and cannot
change regarding character properties. See, again:

http://www.unicode.org/standard/stability_policy.html#Property_Value

And *every* proposal to disunify an existing character based
on property distinctions gets argued to death in committee,
because any such change is potentially very disruptive.

All property changes (other than provisional properties
in the Unihan database, which aren't vouched for at all)
must undergo UTC scrutiny and approval before they are
rolled out. And fewer and fewer changes are made for any
existing properties, both because there is more implementation
experience with the existing set of properties and because
changes are so potentially costly.

But the UTC cannot willy-nilly commit to making more
properties immutable and guaranteeing deprecation and cloning
of characters to make that so.

--Ken

> 
>      john



More information about the Idna-update mailing list