New version, draft-faltstrom-idnabis-tables-02.txt, available

Kenneth Whistler kenw at sybase.com
Wed Jun 20 23:30:52 CEST 2007


Patrik responded to Mark:

> On 19 jun 2007, at 21.35, Mark Davis wrote:
> 
> > If *that* notion of stability is all that is being talked about,  
> > then it is
> > very easy, and we have done it with a number of Unicode properties.  
> > Define
> > the following:
> :
> > Each script other than the archaic ones is no more problematic  
> > overall than
> > the Latin, Greek, and Cyrillic you have already included. Thus if you
> > include Latin, Greek, and Cyrillic, you should include them:
> 
> You can find this out easier than I...
> 
> If we use the rules we have in the document today, what codepoints  
> would have moved from ALWAYS to NEVER and vice versa between 3.2 and  
> 5.0?

The short answer to that is ZERO code points. That's right, none.

The details of the impact of property changes between
Unicode 3.2 and Unicode 5.0 differ, depending on whether the
rules in the document are construed to include the Rule H
"stable script" rule which is in contention, as I explain
below.

> I do not have all the (historical) tables available...

I do, of course. And so what follows is the long answer
to your question -- the detailed history of what
has changed between Unicode 3.2 and Unicode 5.0 which could
have changed the derivation of the ALWAYS or NEVER status for
any existing characters.

In each case, I first state the relevant change in the character
properties. Next I state what the impact would have been based
on "the document today" (i.e., "In the existing draft...") -- and
for that, the terms ALWAYS, MAYBE YES, MAYBE NOT, and NEVER
are used as in the document. 

Finally I state what the impact would have been using the derivation 
that Mark and I have been offering ("In my derivation..."). For
that the term ALWAYS is equivalent to IDN_Permitted=True in
the data files, the term NEVER is equivalent to IDN_Never=True
in the data files, and the term MAYBE means neither is True.

Unicode 3.2.0 --> 4.0.0

1. 12 modifier letters (02B9..02BA, 02C6..02CF) were changed 
   from gc=Sk to gc=Lm.

   In the existing draft, that would move them from "MAYBE NOT"
   to "MAYBE YES".
   
   In my derivation that would move them from MAYBE to ALWAYS
   (i.e. IDN_Permitted=True).
   
2. The two Hangul filler characters 115F..1160 were made
   Default_Ignorable_Code_Point.
   
   In the existing draft, that would move them from "MAYBE YES"
   to "MAYBE NOT".
   
   In my derivation, the Hangul fillers are an allowed part
   of the Hangul script, and that change has no effect.
   They start as ALWAYS and stay ALWAYS (i.e. IDN_Permitted=True).
   
Unicode 4.0.0 --> 4.0.1

1. There were a small number of script reassignments, with Script=Common
   characters going to specific scripts, and with
   Script=Katakana_Or_Hiragana added.
   
   In the existing draft, that would have moved various characters
   between "MAYBE YES" and "MAYBE NOT" status.
   
   In my derivation, no impact.
   
Unicode 4.0.1 --> 4.1.0

1. 9 Ethiopic digits changed gc=Nd --> gc=No.

   In the existing draft, that would move them from "MAYBE YES"
   to "MAYBE NOT".
   
   In my derivation, that would by rule move them from ALWAYS
   to MAYBE. (i.e., IDN_Permitted=False & IDN_Never=False)
   
2. Script value changes. Script=Katakana_Or_Hiragana was removed.
   Several Coptic letters in the Greek or Coptic block went
   from Script=Greek to Script=Coptic (because the Coptic script
   as a whole was added to the standard).
   
   In the existing draft, that would have moved various characters
   between "MAYBE YES" and "MAYBE NOT", and for the Coptic
   letters from "ALWAYS" to "MAYBE YES" or from "NEVER" to
   "MAYBE NOT".
   
   In my derivation, no impact.
   
Unicode 4.1.0 --> 5.0.0

1. U+10341 GOTHIC LETTER NINETY, gc=Lo --> gc=Nl

   In the existing draft, "MAYBE YES" --> "MAYBE NOT"
   
   In my derivation, no impact. (archaic script exclusion)
   
2. U+2132 TURNED CAPITAL F, gc=So --> gc=Lu (and case-paired with new
   lowercase form) and changed from Script=Common to Script=Latin.
   
   In the existing draft, "MAYBE NOT" --> "NEVER"
   
   In my derivation, no impact.
   
That's it. Everything. All other property changes between Unicode 3.2.0
and Unicode 5.0.0 would have had no impact on the table derivation,
whether following the current draft rules or following the rules
that Mark and I have been advocating for the derivation of
the IDN_Permitted and IDN_Never properties.

Note first that while various characters would have changed
status (although this is a mere handful of characters even
at that) there are no instances at all of a character that
would have been ALWAYS before and changed to NEVER or
that would have been NEVER before and changed to ALWAYS by
the rules, even taking into account the funny determination
of "stable scripts" for Rule H.

Note secondly that even for the handful of characters, the changes
of status according to the current draft rules are considerably
more extensive (mostly because of the Rule H impact) than if
the propert(ies) are derived the way Mark and I have been
advocating. 

If using the IDN_Permitted and IDN_Never model
and the derivation rules for those, the *total* impact,
if starting from Unicode 3.2.0 and moving forward to
Unicode 5.0.0, would be:

12 modifier letters (02B9..02BA, 02C6..02CF) become ALWAYS 
   (i.e. IDN_Permitted=True) and stay there. (They started out
   as MAYBE.)
   
9 Ethiopic digits would have started out ALWAYS (i.e. IDN_Permitted=True),
   and become MAYBE (i.e. IDN_Permitted=False, but also IDN_Never=False).
   
That's it. Nothing moves from ALWAYS to NEVER, and nothing moves
from NEVER to ALWAYS, and only two small, specific sets of characters
would have seen any change at all. That's why Mark can say:

  "If *that* notion of stability is all that is being talked about,  
   then it is very easy, ..."
   
Incidentally, the UTC would go even further in guaranteeing
stability than that. I think there is consensus in the UTC that
while a MAYBE --> ALWAYS change (i.e. adding a new character
to the permitted inclusion set) is an o.k. thing, the
opposite transition, from ALWAYS --> MAYBE is *NOT* o.k.
Disallowing that is a stronger stability guarantee than
merely determining that you would disallow an ALWAYS --> NEVER
change.

The way the UTC would do that is by treating the ALWAYS and
NEVER categories like one-way traps. Characters could go into
them, but once there, they can't crawl back out. And the
mechanism for that is called "grandfathering" -- which I'll
explain in a separate note.

Looking at the above two sets of characters which are the
only ones that would have been impacted at all by
Unicode 3.2.0 --> 5.0.0 character property changes, insofar
as we are talking about this IDNA inclusion table generation,
the change for the 12 modifier letters would be fine.
But the change for the 9 Ethiopic digits would have been
*NOT* o.k. And it would have been handled by the grandfathering
mechanism, to ensure that a starting status of ALWAYS stayed
that way permanently.

> I think this might be interesting to know in this discussion, but  
> also I guess I should explain why we do not expect any further such  
> change in the same scripts when moving forward.

What I have presented above is, I believe, an accurate representation
of the impact of the history of Unicode character property
changes from Unicode 3.2.0 (2002) and Unicode 5.0.0 (2006).

I think it should be taken as a reasonable demonstration that
there is little likelihood of character property changes
coming from the UTC moving forward from Unicode 5.0.0 that
would destabilize an IDNA inclusion table definition in
the ways that folks here are worrying about.

--Ken





More information about the Idna-update mailing list