Table issues (Part 2)

Mark Davis mark.davis at
Wed Dec 5 04:17:10 CET 2007

I'll add to what Ken pointed out.

Over time, we've dealt with this issue many times, and have evolved a
technique for dealing with it, one that we have used successfully for many
versions of Unicode.

When we define a new property value as *derived* from some other combination
of other properties, we also add a "grandfathering" compatibility set. For
example, we do this for the Unicode definition of identifiers: see

Applying that same technique here, for example, we'd have an
"Other_IDN_Always" property, that would contain all and only those
characters that were {IDN=Always} in a previous version but that wouldn't
otherwise be according to the derivation for the current version of the
other properties. (These grandfathered sets are, of course, mechanically
generated and verified, to prevent errors.) By doing this, we'd guarantee
that all characters that were {IDN=Always} in a previous version are
{IDN=Always} in the current version; we'd do the same for {IDN=Never}, as
well. Because we're dealing with a partition, the derivations would look

{IDN=Always} = {derivation from general 5.2 properties} + Other_IDN_Always -
{IDN=Never} = {derivation from general 5.2 properties} - Other_IDN_Never -


On Dec 4, 2007 4:42 PM, Kenneth Whistler <kenw at> wrote:

> Patrik,
> Some more feedback on draft-faltstrom-idnabis-tables-03.txt.
> Section 4. Codepoints states:
> "The Categories and Rules defined in Section 2 and Section 3 apply to
> all assigned Unicode characters."
> In fact they also apply to *unassigned* code points as well.
> I think the correct formulation would be:
> "The Categories and Rules defined in Section 2 and Section 3 apply to
> all Unicode codepoints, assigned or unassigned."
> [Note: the Unicode Standard systematically uses a space
> in the term "code point", as well as for "code unit",
> "code position", "code value", etc. But given that this
> document uses "codepoint" everywhere, I'm not suggesting
> that be changed. Nobody is going to be confused as to
> what the word means.]
> Section 1. Introduction
> Description of ALWAYS states:
> "Once assigned to this category, a character is never removed
> from it unless it is removed from Unicode."
> The qualification "unless it is removed from Unicode" is
> vacuous. Since Unicode 1.1, no character ever has been
> removed from Unicode, nor will any be -- in part because
> no character will ever be removed from ISO/IEC 10646.
> So this is a quibble is a little like qualifying the
> definition of ASCII LDH as "{0061..007A, 0030..0039, 002D}
> and no characters will be removed from this definition
> unless they are removed from ASCII."
> So I suggest just removing the vacuous qualification.
> Large paragraph in the middle of page 3 state:
> "It should be suitable for newer revisions of Unicode, as
> long as the Unicode properties on which it is based remain stable."
> This points out a fundamental problem with the categories
> used for this IDN property, in that it depends crucially
> on stability of the Script property. However, unlike
> a number of the Unicode (immutable) properties and
> algorithms (such as normalization) for which there are
> stability guarantees that the Unicode Consortium takes
> very seriously, there is *NO* stability guarantee for
> the Script property. To assume that there is one is
> setting this specification directly on a collision course
> with the UTC at some point in the future.
> Let me illustrate this with a *real* (not hypothetical)
> issue which just came up on the unicode list discussions
> during the last week.
> Egyptologists are considering specifying the
> use of the Egyptological yod in transliteration. The
> character has the correct semantics and shape, and
> using it would avoid a potential shaping and positioning
> issue with U+0313 COMBINING COMMA ABOVE. But if they
> start using it, they would be using it in combination
> with *Latin* characters, not Old Church Slavonic (Cyrillic)
> characters. In part to avoid potential font selection
> heuristic issues, they are now requesting that
> U+0486 have its Script property be changed from
> Script=Cyrillic (its current value in Unicode 5.0)
> to Script=Inherited (in Unicode 5.1). This is a perfectly
> reasonable type of change, from the UTC's point of view,
> and parallels other changes made to script properties
> when a character once seen as belonging only to a single
> script comes to have an application in another script,
> in which case it might change to Script=Common or
> Script=Inherited, depending on the case.
> Now consider the implications for the Calculation of
> the Derived Property in Section 3 of
> draft-falstrom-idnabis-tables-03.txt. By the Unicode 5.0
> property values, U+0486 would be determined to be ALWAYS.
> And in fact, Appendix A has the resulting entry:
> *But*, if the UTC deals with the reasonable request from
> the Egyptologists by changing U+0486 to Script=Inherited
> or Script=Common, then by the Calculation in Section 3,
> U+0486 would end up instead as MAYBE YES.
> ALWAYS ==> MAYBE YES is a prohibited state transition
> for this IDN property.
> That is a very bad situation to be in for this specification.
> Some property change that is perfectly normal and reasonable
> for the UTC would end up resulting in what an IETF
> protocol would be claiming is an absolutely prohibited
> change. We'd either end up with finger-pointing and blame
> going back and forth regarding who is responsible for
> the destabilization, or else end up hacking up emergency
> fixes to the IDNA RFC, so that the calculation of the
> derived property itself could be stabilized against
> future changes to properties it depends on.
> Meanwhile, the constituencies of character users that
> come to the UTC -- like these Egyptologists -- would
> be flabbergasted, flustered, and highly annoyed, if
> they were simply met by intransigence on the UTC's
> part, saying, "Sorry, we can't change that Script
> property for you, because we are absolutely bound by
> a stability guarantee for IDNs. Changing that property
> would destabilize IDNA, and we can't do that."
> They would neither understoodd nor would they stand for what would
> seem to them to be a completely arbitrary and
> unmotivated stonewalling on the issue. It's bad enough
> the crap we have to put up with in maintaining stability
> for normalization, for which the UTC is highly motivated
> to maintain stability for many reasons besides its
> use in IETF protocols.
> I really think the whole MAYBE YES category based on
> Script determinations is a trap for this specification.
> It is attempting to be conservative in making commitments
> regarding the use of certain characters in IDN. But
> in using the Script property distinctions the way
> Section 3 current is doing, all it really is going
> to accomplish is to *guarantee*, in the not too distant
> future, a stability meltdown for the ALWAYS and NEVER
> categories that the specification is trying to define
> and claim stability for.
> Regards,
> --Ken
> _______________________________________________
> Idna-update mailing list
> Idna-update at

-------------- next part --------------
An HTML attachment was scrubbed...

More information about the Idna-update mailing list