Table-building

Fri Feb 2 22:31:28 CET 2007

> Ken,
> 
> do I understand your text below as saying

Well, I can't tell what you understand it as saying, other
than to conclude from your question that you think
it means:

> 
> "given sufficient thrust, characters can migrate from IDN_never to 
> IDN_Permitted"?

But that is not what is says, nor is that what I intended
for it to say.

Among other things, *I* have not been using the "IDN_never"
formulation at all. That is something that Mark introduced
here in an effort to reconcile the perceived need by some
to have a class of characters that could be guaranteed to
*never* be in IDN, no matter what.

Furthermore, even if another property, IDN_Never, were
introduced and maintained, characters don't "migrate" from
one property to another. Characters may have changes in
their property values, but that is a different thing.

So for argument's sake, let's assume that the UTC decides
to maintain both an IDN_Permitted and an IDN_Never property.
Those would be *properties*, not the *values* of the
properties. They would be binary properties, which means
their values would be "True" or "False".

And schematically, you would then partition the Unicode
repertoire up as follows:

Characters        xxxxxxxxxxxxxxxx xxxxxxx xxxxxxxxxxxx
IDN_Permitted     TTTTTTTTTTTTTTTT FFFFFFF FFFFFFFFFFFF
IDN_Never         FFFFFFFFFFFFFFFF FFFFFFF TTTTTTTTTTTT

By construction, no character with IDN_Permitted=True
could also be IDN_Never=True.

By stability guarantees that would be provided for
the algorithm, no character with the property value
IDN_Permitted=True would ever turn to IDN_Permitted=False.

By stability guarantees that would be provided for
the algorithm, no character with the property value
IDN_Never=True would ever turn to IDN_Never=False.

The only wiggle room is in the middle collection, where
people "haven't made up their minds", and where, based
on subsequent input, presumably, a determination could be
made to either set IDN_Permitted=True OR to set
IDN_Never=True (but of course not both).

O.k., is that clear so far?

Now, conceptually, this scheme of two binary properties could
then be mapped to what you and Patrik are advocating as
a tri-state table, namely:

Characters        xxxxxxxxxxxxx xxxxxxx   xxxxxxxxxxxx
IDN_Permitted     TTTTTTTTTTTTT FFFFFFF   FFFFFFFFFFFF
IDN_Never         FFFFFFFFFFFFF FFFFFFF   TTTTTTTTTTTT
Tri-state         [    yes    ] [pending] [   no     ]

where "pending" in this instance means characters for which
we still have not made an irrevocable choice to toss them
into either the "yes" or the "no" piles.

And if you want to take it further, to include John's
specification of "unassigned" as constituting the fourth
state, then you get the following scheme, where "o" refers
to unassigned code points and "x" refers to assigned
code points:

Code points    xxxxxxxxxxxxx xxxxxxx   xxxxxxxxxxxx ooooooo
IDN_Permitted  TTTTTTTTTTTTT FFFFFFF   FFFFFFFFFFFF FFFFFFF
IDN_Never      FFFFFFFFFFFFF FFFFFFF   TTTTTTTTTTTT FFFFFFF
Quad-state     [    yes    ] [pending] [   no     ] [unassigned]

A future version of Unicode turns some o's into x's,
moving them from "unassigned" to "pending". Then a determination
needs to be made whether to set IDN_Permitted=True
or IDN_Never=True, or neither. If neither, then the
character stays in the "pending" category of the quad-state,
unless and until somebody decides later on that either
IDN_Permitted or IDN_Never need to be changed to True.

O.k., I trust that is a reasonably accurate rendition of
what you and Patrik are trying to get at with the
tri-state table discussion.

If we are still on the same page to this point, then I
will move on to my claims about what *should* be done.

I consider the whole discussion of a tri-state table to
be inadvisable as a way to either document this issue
in the specification or to roll out implementations.
If it takes *this* small group weeks to sort this through
to get past misunderstandings about intents, there is *zero*
chance that a specification of this sort will not lead
to problems and confusion in implementation.

If you *must* have a formal guarantee about some set of
characters that will never be in IDN's, then defining a
second binary property, IDN_Never, is the way to go.
That gives you another unambiguous partition of Unicode
space into clear values. (And by "unambiguous" here --
since in this group, even the use of "unambiguous" seems
to be ambiguous -- I mean each code point will
be associated either with a True value or with
a False value, instead of having some property for
which the values could be: True, False, or Maybe.)

Such a scheme is doable, although as Mark says, it implies
more work, because then you have to spend time defining
the second set (IDN_Never=True) as well as the immediately
important set (IDN_Permitted=True). But if the need is
there simply to provide the required comfort level for
the specification, then Mark and I could draft a set
of rules for IDN_Never=True, as well as for
IDN_Permitted=True.

But my contention is that that is really *not* necessary,
and that all this specification needs is:

Code points    xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx ooooooo
IDN_Permitted  TTTTTTTTTTTTT FFFFFFFFFFFFFFFFFFF FFFFFFF

In future versions of Unicode, some o's will become x's.
And when they do, some of those x's will get
IDN_Permitted=True.

It *may* be the case, given sufficient information that
at the same time, some x's that were already x's will
also get IDN_Permitted=True.

If you want to conceive of that as being some "pending"
going to "yes", then that would be fine.

It really comes down, I guess, to a matter of trust.

If the IETF doesn't trust the UTC to not suddenly
go totally crazy and insist that for Unicode 6.0 500 existing
math symbols need to become IDN_Permitted=True, despite
the fact that they are disallowed for Unicode identifiers
in general, well then maybe we need to invent and maintain
an IDN_Never property with written guarantees about
permissible and non-permissible property value changes 
for it.

If the IETF *does* trust the UTC to be judicious in
property changes, and to limit the potentially affected
characters to the moral equivalent of newly added
KHMER SIGN LEK TOO cases in less-well-documented
scripts, then we can keep the specification
much simpler to understand and to implement.

So which is it? Trust and a simpler specification and
simpler implementations more likely to succeed?

Or no trust, and a more complex specification and
more complex implementations that pile up more strikes
against the specification?

--Ken

>                     Harald
> 
> --On 1. februar 2007 17:28 -0800 Kenneth Whistler <kenw at sybase.com> wrote:
> 
> > This should be formulated as:
> >
> >    The UTC promises never to include the unreasonable
> >    symbolic and punctuation crap that nobody wants
> >    in IDNs in the IDN_Permitted property. If a newly
> >    encoded character gets the gc=Po property, by
> >    general rule it won't be included in IDN_Permitted.
> >    If, however, it turns out that a newly encoded
> >    character actually is a letter of some sort, and
> >    was mistakenly given the gc=Po property, and *if*
> >    it is important enough for IDN's that some community
> >    wants it added, then the UTC will correct its
> >    General_Category to gc=Lo (or whatever) and include
> >    it in IDN_Permitted in a future revision of the
> >    standard, just as it would have if the property
> >    were correctly identified from the start. Furthermore,
> >    the UTC will *never* remove a character from
> >    IDN_Permitted, no matter *what* other property
> >    corrections might turn out to be warranted for it.