What has to be stable? (was Fwd: Comments on IDNAbis tables-03)

Wed Jan 16 22:38:11 CET 2008

Catching up on IDNA.... Comments below on stability.

On Dec 25, 2007 2:39 AM, Patrik Fältström <patrik at frobbit.se> wrote:

> On 17 dec 2007, at 22.39, Mark Davis wrote:
>
>
> What we have said is that the properties we use in the algorithms have to
> be stable.
>

I think this is the crux of the issue.  The goal is for the *IDNA
property*to be stable. Bear with me for a minute -- the following may
seem to be
pedantic, but I am just trying to take this slowly so that I don't
miscommunicate.

It appears that we have a case of affirming the consequent, with the
following argument:

   1. If the base properties are stable, then the IDNA property will be
   stable. [A => B]
   2. The IDNA property must be stable. [B]
   3. Therefore the base properties must be stable. [A]

However, that argument is invalid, and the conclusion isn't true. Take a
parallel case. Suppose we have a room with a fixed number of people in it,
including me and Bill Gates. Someone has a condition that the average salary
be stable, and argues:

   1. If each individual's salary is stable, then the average salary will
   be stable.
   2. The average salary must be stable.
   3. Therefore the individual's salaries must be stable.

But the argument is invalid. Suppose Bill pays me a billion dollars. Our
individual salaries changed radically, yet the average remains the same.
(Just to be clear on this, Michael, in the interests of progressing IDNA, I
am willing to try this experiment out ...)

Let's get back to the overall goal, which is stability of the IDNA property.
Many changes in base properties have no affect. For example, if a character
changes the script property, but from a script that's included to another
script that's included, it makes no difference in the outcome of the rules.
There is the possibility of a base property changing, in a new version of
Unicode, so as to affect the results according to your current formulation.
But as pointed out, there is an easy and reliable way to modify your
formulation so as to be perfectly stable.

Define the following contributing properties, initially empty. Make the
rules for ALWAYS and NEVER always include them (and exclude the converse
ones).

   - ALWAYS_INCLUSIONS
   - NEVER_INCLUSIONS

With each version X of Unicode, update them so to include any characters
that were in ALWAYS or NEVER according to version X-1, but wouldn't be
according to X without these properties.

The updates could be best be done by the Unicode consortium as a separate
so-called contributing property, but it is perfectly possible for any other
organization to do it independently if you're nervous about that. As
discussed in earlier email, we've been doing this kind of thing for years
with the current Unicode identifiers, and there are a very small number of
characters affected. Note some work needs to be done with each version of
Unicode anyway, since as new scripts come in, they have to be put in one or
another buckets.

...

Given that, let me take a shot at answering your questions.

The questions now I think are:
> a. What differences exist between the algorithms in the table document and
> Unicode "stable" properties?
>
Some of the properties are stable, like isNFKC, and some are not: general
category, script,...

> b. What differences exists between IDNA2003 and the tables document?
>

Well, the easiest way to play with that is to use the utilities I mentioned.

Go to http://unicode.org/cldr/utility/unicodeset.jsp and in Input A, put:

[:idna=output:]

and in Input B, put

[[:L:][:Nd:][:Mn:][:Mc:]
&[:isCaseFolded:]
-[:NFKC_QuickCheck=NO:]
-[:Default_Ignorable_Code_Point:]
 [\u00B7\u05F3\u05F4\u3007\u30FB]
 [a-zA-Z0-9\-]
&[:age=3.2:]]

Then hit Compare.

This doesn't include the script, blocks and exceptions, but will give you an
idea of the difference. Note that I put in the condition that the Age (that
is, Unicode Version, be 3.2, so that we'd be comparing Apples to Apples.

If you click on one of the links, you see the differences in more detail. In
this case, for example, the characters in IDNA2003 (output) that are not in
the formulation are 3,032 Code Points.

If you are feeling more adventurous, you can exclude scripts/blocks
according to the draft Unicode 5.1 recommendations (see Ken's emails, plus
http://www.unicode.org/reports/tr31/tr31-8.html#Specific_Character_Adjustments
).

That expression for Input B would be:

[[:L:][:Nd:][:Mn:][:Mc:]
 &[:isCaseFolded:]
 -[:NFKC_QuickCheck=NO:]
 -[:Default_Ignorable_Code_Point:]
 -[:script=Bugi:]
 -[:script=Buhd:]
 -[:script=Cari:]
 -[:script=Copt:]
 -[:script=Cprt:]
 -[:script=Dsrt:]
 -[:script=Glag:]
 -[:script=Goth:]
 -[:script=Hano:]
 -[:script=Ital:]
 -[:script=Khar:]
 -[:script=Linb:]
 -[:script=Lyci:]
 -[:script=Lydi:]
 -[:script=Ogam:]
 -[:script=Osma:]
 -[:script=Phag:]
 -[:script=Phnx:]
 -[:script=Rjng:]
 -[:script=Runr:]
 -[:script=Shaw:]
 -[:script=Sund:]
 -[:script=Sylo:]
 -[:script=Syrc:]
 -[:script=Tagb:]
 -[:script=Tglg:]
 -[:script=Ugar:]
 -[:script=Xpeo:]
 -[:script=Xsux:]
 -[:block=Combining_Diacritical_Marks_for_Symbols:]
 -[:block=Musical_Symbols:]
 -[:block=Ancient_Greek_Musical_Notation:]
  [\u00B7\u05F3\u05F4\u3007\u30FB]
  [a-zA-Z0-9\-]
 &[:age=3.2:]]

It shows that there would be 3,419 Code Points in IDNA2003 output that are
not in that formulation. Click on the link for details. I did remove the
Phaistos Disk block, since that is not yet in Unicode and thus not
recognized by the utilities.

Hope that helps...

> c. What differences will exist if tables document algorithms are only
> based on stable properties (where some of them are only stable from Unicode
> 5.0)
>

The general  category is crucial. However, as stated above, we don't need
this for stability anyway.

>    Patrik
>
>

-- 
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20080116/0f955ef6/attachment-0001.html