Table-building

Kenneth Whistler kenw at sybase.com
Thu Feb 1 20:32:18 CET 2007


Harald responded to Mark:

> --On 31. januar 2007 18:01 -0800 Mark Davis <mark.davis at icu-project.org> 
> wrote:
> 
> > I think this does expose an issue that needs discussion. There are two
> > types of stability that could be guaranteed.
> >
> > 1. Once a character is encoded, the property value (true or false) MUST
> > never change.
> > 2. Once a character is given the property value of true, its value MUST
> > never change to false. An encoded character SHOULD not change from false
> > to true, unless a strong case can be made for it.
> >
> > For both of them we have the key requirement for stability, that once a
> > string qualifies as being valid, it stays valid forever.
> >
> > However, #1 might be a bit too restrictive. If we currently say that
> > character X has the value false, but there is an issue if for some reason
> > we find out that that character is needed for some orthography of a
> > language in, say, the Congo. People who think that #2 is not sufficient
> > might present some scenarios where it could cause a problem (I can't
> > think of any myself).
> 
> Well put.
> 
> The discussion has exposed literally dozens of cases where the answer to 
> "should the property be true or false" is "We don't know yet".

No, not "literally dozens of cases". The discussion has called
into question, to date:

Catalan:

000B7 gc=Po sc=Zyyy MIDDLE DOT

Hebrew:

005F3 gc=Po sc=Hebr HEBREW PUNCTUATION GERESH
005F4 gc=Po sc=Hebr HEBREW PUNCTUATION GERSHAYIM

Yiddish:

005F0 gc=Lo sc=Hebr HEBREW LIGATURE YIDDISH DOUBLE VAV
005F1 gc=Lo sc=Hebr HEBREW LIGATURE YIDDISH VAV YOD
005F2 gc=Lo sc=Hebr HEBREW LIGATURE YIDDISH DOUBLE YOD

And generically:

0200C gc=Cf sc=Qaai ZERO WIDTH NON-JOINER
0200D gc=Cf sc=Qaai ZERO WIDTH JOINER

although I think there is little doubt that we have to include
the latter two, with appropriate contextual restrictions, unless
we want to cause uproars for Persian and for most Indic script
communities.

And then Soobok has raised questions about compatibility jamos,
but that is an issue for the input end, and I don't think
it makes a difference for the inclusion table content.

All the rest has been general handwringing without any focus
on particular characters.

> 
> It is clearly stupid of any registry to allow the registraition of such 
> characters, given that the property MAY end up false.

That is a misunderstanding of Mark's statement.

Before any implementation of IDNA nameprep could go out, a
determination has to be made for all Unicode characters whether
they are in an inclusion table or not. If the property *at that
point of first implementation* is True, it will NEVER NEVER NEVER
end up False. That is the case, whether you choose option #1 or
option #2 above.

The distinction Mark is making is that under option #1 no
assigned characters for which the property is False could ever
be added to the table, whereas under option #2, a determination
that already encoded characters not in the table should be
added later could be made. And I agree with Mark that option #1
is perhaps too inflexible.

> It is equally stupid of any application developer to deny the attempt to 
> lookup such characters, given that the property MAY end up true.

That's debatable. I don't consider it stupid, in any case.
But I understand your point.

The right thing to do, in my opinion, is to drive the set of
such characters to zero from the start, so you don't have this
issue in the specification.

Basically, you are trying to bombproof the specification from
people coming in later insisting that such-and-such a character
be made available in IDN when it wasn't before. However, that
was also what the IDNA2003 spec was attempting to do with
what turned out to be its overly permissive approach.

However, rather than guarantee that you have a perpetual
problem of a pot full of "pending" status characters, which have to
be denied registration until somebody (who?, how?) decides their status
but which also have to be passed by a resolver, regardless
of their status, you are far better just giving such cases
IDN_Permitted=True status and be done with it.
 
> I think that's what the "tri-state" tables are trying to express.

At this point, rather than building the argumentation about
the eventual content of the table *into* the table and the
specification, I believe it is far better to simply do the
best possible job of constructing the table, based on the
justifiable kinds of principles already articulated here,
and then let folks give feedback (and complain as necessary)
during the review, until we come up with an appropriate
list that can be used as the basis for the specification.

And the requirement for letting ZWJ and ZWNJ in, under constrained
contexts, is precisely trying to protect this process against
the kind of after-the-fact criticism that you seem to be
worried about.

--Ken





More information about the Idna-update mailing list