What has to be stable? (was Fwd: Comments on IDNAbis tables-03)

Tue Dec 25 11:39:14 CET 2007

On 17 dec 2007, at 22.39, Mark Davis wrote:

> Patrik, I'm afraid that somehow we have a miscommunication.

Possibly. Or even "most certainly" :-)

> We haven't ever
> been saying that all the properties that you've considered in
> tables-xx.txtare stable.

What we have said is that the properties we use in the algorithms have  
to be stable.

The rubber as not hit the road yet, I am aware of that, and that is  
why I want this meta discussion now. Before going into the details of  
the tables document.

> What we
> *have* been saying is that once the IETF comes up with a set of  
> rules that
> define an IDN property, we can and will commit to providing the  
> mechanism to
> stabilize that IDN property. Nobody has been saying that all of the
> underlying properties will themselves be stable.

Correct, but the question is also (for the IETF, as this is an IETF  
document) wether the property value (ALWAYS, MAYBE YES, MAYBE NO,  
NEVER etc) is to be provided as a stable property by the Unicode  
Consortium, or whether some underlying data is provided by Unicode  
Consortium (and possibly others) and then the property value is  
defined by an IETF defined algorithm/process/table.

At the moment, we have the latter process. That something is provided  
by the Unicode Consortium that is stable enough so that the IETF  
defined algorithm according to what is in the tables document give a  
property value that is stable according to the definitions: Basically  
"Never change the property value of a codepoint that has value ALWAYS  
or NEVER".

I.e. there is a meta issue here that might end up in much more formal  
liaison statement, agreements and what not between various  
organisations. And IF such formality is needed, we need to know what  
the needs are, and who are the involved parties.

> *This is a bit tricky, so please bear with me. *Because I have not
> communicated this effectively so far, please read this over  
> carefully and
> let me know where you have questions or I am unclear. I will letter  
> these
> items for reference.
>
>
> A. There is always a tension between having properties be stable and  
> having
> them be as accurate as possible.

Agreed.

And this is where, I claim, IETF have concentrated on stability, while  
Unicode Consortium has concentrated on correctness. And that is where  
the shoe is not fitting very well (as we say in Sweden).

I am here not saying one of the models is better than the other one.  
Different goals.

>   1. Application X wants to get the most accurate information about
>   characters with property X, and doesn't care about compatibility.
>   2. Application Y wants to property X in some special way (such as
>   identifiers) that requires absolute backwards compatibility. If  
> the Unicode
>   consortium finds out that a character has different properties,  
> that doesn't
>   matter. Compatibility swamps accuracy.
>
> Because of that, all and only the properties on
> http://www.unicode.org/policies/stability_policy.html are guaranteed  
> to be
> stable. That page sets out exactly how the properties are stable.

Thanks for this list. If I compare this list with the list I sent to  
the list (what properties the algorithm in the tables document is  
based on) I see that the following is not on your list: script and  
block.

Am I correct?

> The Unicode consortium, however, does have a tool that it has used
> successfully for many years to guarantee absolute stability, *while  
> allowing
> for fixes to underlying properties. *
>
> B. Let's suppose IETF wants to define IDN_ALWAYS as being characters
> according to a given formulation, such as the following (for  
> brevity, X
> means the set of characters having the property X)
>
>   - IDN_ALWAYS = ((X + Y + Z) - W) + V.
>
> The properties X, Y, Z, W, and V the *underlying properties* for  
> IDN_ALWAYS.
> The Unicode consortium can supply absolute stability for IDN_ALWAYS  
> in the
> following way. In each version of Unicode, the consortium would  
> commit to
> defining the property:
>
> Other_IDN_ALWAYS
>
> to be all those characters that were in IDN_ALWAYS in the previous  
> version,
> but would not be in the current version according to the formulation.
>
> Then the IETFs formulation becomes:
>
>   - IDN_ALWAYS = (((X + Y + Z) - W) + V) + Other_IDN_ALWAYS
>
> That provides absolute stability across versions. Given a version of  
> Unicode
> V and the IETF's formulation, anyone can calculate IDN_ALWAYS for  
> version V,
> and it will always include all characters that were in IDN_ALWAYS for
> version V-1.
>
> Nobody needs to access multiple versions of Unicode to make this
> calculation. As far as I understand them, I believe this satisfies  
> all of
> your requirements for stability.

Understood.

> C. Now, what we would also do in Unicode would be to provide, in each
> version of Unicode, what is called a "derived property", where we go  
> ahead
> and compute the tables for IDN_ALWAYS. This is simply a convenience to
> users, since the vast majority don't want to do the computation  
> themselves;
> they just want to get the values. But someone always could.

Also understood. So far the signs I have heard in the IETF is though  
that the algorithm itself (((X + Y + Z) - W) + V) is to be explained  
in an IETF document.

> D. If you want to see an example of this, look at the Unicode  
> identifiers
> over the years. Those use two properties (like C or Java, some  
> characters
> cannot be at the start of identifiers).
>
> ID_Start = [[:L:][:Nl:][:Other_ID_Start:]]
> ID_Continue = [[:L:][:Nl:][:Mn:][:Mc:][:Nd:][:Other_ID_Continue:]]
>
> // this uses the POSIX syntax, whereby [:L:] is the set of all code  
> points C
> such that general_category(C) = Letter.
> // one can of course use other syntax, like Perl's \p{L}, and so on.
>
> We formalized and stabilized them in Unicode 3.0, in 1999.
>
> If you search for Other_ID_Start and Other_ID_Continue in the  
> following
> files, you'll find that over the course of the eight years since  
> then, we've
> added a handful of characters to the grandfathering categories  
> Other_... so
> as to maintain backwards compatibility with each and every release.
>
> http://www.unicode.org/Public/4.0-Update/PropList-4.0.0.txt
> http://www.unicode.org/Public/4.1.0/ucd/PropList.txt
> http://www.unicode.org/Public/5.0.0/ucd/PropList.txt
> http://www.unicode.org/Public/5.1.0/ucd/PropList-5.1.0d20.txt (the  
> current
> beta)
>
> We have a set of tools that we run over each release to verify
> compatibility, and we have a beta period of several months for each  
> release
> where others can run external own tools as well.

Ok.

> E. While I do believe that the Unicode consortium would be best  
> placed to
> update the categorization of characters over time, based on the broad
> internationalization expertise of its members, that is an orthogonal  
> issue.
> The IETF could decide that it wants some other group to do that --  
> that does
> not affect the consortium's commitment to provide for stability of  
> IDN. It
> would, of course, require synchronization between the groups, much  
> like the
> extremely successful working relationship between the consortium and  
> the ISO
> subcommittee responsible for ISO 10646.
>
> I'm hoping this makes sense to you now.

Yes.

The questions now I think are:

a. What differences exist between the algorithms in the table document  
and Unicode "stable" properties?
b. What differences exists between IDNA2003 and the tables document?
c. What differences will exist if tables document algorithms are only  
based on stable properties (where some of them are only stable from  
Unicode 5.0)

    Patrik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/idna-update/attachments/20071225/15d1ebbf/attachment.html