Comments on IDNAbis tables-03

Sun Dec 16 11:26:59 CET 2007

Before we go into the details of the comments of this document (thanks  
for those), I have to rise an overall issue that has been boiling for  
a while, and that has to do with stability of the properties defined  
by the Unicode Consortium. Reason why this discussion is needed before  
I start working on the overall issues you rise here will hopefully be  
more clear later.

As data is stored in databases (like DNS) for a very very long time,  
anything that compare codepoints based on some property value MUST  
only use the derived property values that has the backward  
compatibility features you describe. Backwards compatible as in "if  
codepoint a have property x in version N of unicode, it should also  
have that property in version N+1".

In early discussions on stability, you from UTC said that things WILL  
BE STABLE, and you personally have said so to the IETF several times.  
We have displayed the algorithms to you several times, and we have  
also said the calculations in the IDNAbis document will be based on  
base properties and not derived properties -- so that EVERYONE can  
easily calculate the derived value if they have the need for it. We  
have also agreed that things will NEVER move from ALWAYS or NEVER, and  
you have also been part of the discussion regarding Cyrillic and Latin  
(as those where said being "known").

You come now and say revoke so many things you and other UTC people  
have stated in the IDNAbis discussion that I do not know how to  
continue the work.

The rules will never be based on derived property values. People MUST  
be able to calculate the ALWAYS etc properties given the CURRENT  
Unicode distribution.

The overall goal with IDNAbis is to be independent of Unicode Version.  
This implies it MUST be possible for anyone that have "the  
distribution of Unicode" to compute the value of the derived property  
that tell the status in IDNAbis. An alternative would be to have the  
derived property "just" appearing as a table that noone but a closed  
group can compute (or codepoints end up there in an ad-hoc based  
mechanism).

This do though imply that the base properties the algorithm is based  
upon are stable. At least stable in the cases where IDNAbis is to  
ensure stability. And with this I imply for example "codepoints that  
are in ALWAYS will never be removed from ALWAYS" etc.

This in turn imply the calculations and the properties that lead to a  
codepoint end up in ALWAYS or NEVER will never change in such a way  
that the calculations lead to a different result in the future.

This is why Cyrillic, Latin etc where selected as pointers to  
codepoints that are believed to be stable so that we dare(!) to put  
codepoints in those scripts in the ALWAYS and NEVER categories. For  
other scripts we see changes between the versions of Unicode still.  
Changes that are large enough so they _CAN_ have implications on the  
domain names that are already stored in the DNS database in the form  
of registered domain names.

The reason IETF require stability, as we have explained before, is  
that if a is registered as a domain name, a lookup for a should always  
give a match in the future. One must be able to use the domain name  
one have registered for all times in the future. This is what IDNAbis  
is concentrating on. Ensuring that if a is in ALWAYS and registered in  
DNS, it should stay there.

If we then include that also b should be stable because f(b)=a (case  
folding etc) then we have a much larger problem. How can we ensure  
that b will continue to have the properties needed, and how can be  
ensure that the function f(x) is stable by itself?

I have heard you say many times when we get this far in the discussion  
"but that is no problem". You even say below that MAYBE YES should be  
removed, as things very easily can be added to the ALWAYS category.

But that is not a statement I agree with, and let me explain why. I  
have two points here to make:

(1) There is currently a suggestion on the Unicore mailing list to  
move a codepoint from script cyrillic to inherited. This (if we would  
have taken inherited into account in the tables document) would move  
the codepoint from ALWAYS to CONTEXT according to my preliminary  
thinking. But that is not the point. The point is that suddenly I am,  
and many people should, be very very very afraid of including cyrillic  
script in the list of codepoints that are stable enough to have things  
in the ALWAYS category. Removing Cyrillic from there have implications  
on the ability to register codepoints using cyrillic as IDN domain  
names, and I am pretty sure that change will be discussed at the next  
meeting of the Internet Governance Forum. Russia have, as I hope you  
know, very strong feelings regarding use of Cyrillic "on the Internet".

That a discussion even exist to change any properties regarding a  
codepoint that is part of the cyrillic script surprises me given the  
statements you have made regarding stability.

(2) You have in mail to me said that properties not at all are stable.  
This is for me something that is completely orthogonal to statements  
similar to "it is easy for people knowing scripts to add more things  
to ALWAYS". You have further explained that stability is ensured by  
defining a new derived property in the following way:

Say codepoint a have property x. As x is not a stable property (as no  
properties are stable) one have a derived property is_or_has_been_x  
that all codepoints have either have or have had that property has.  
This implies the codepoint a might no longer have property x, but will  
have property is_or_has_been_x. If we now base the IDNAbis tables on  
this derived property three things happens:

(a) It is impossible for people outside unicode consortium to  
calculate the tables, as one can not know what codepoints have (since  
version N of Unicode) had the property value x, and because of that it  
is impossible to know what codepoints have property value  
is_or_has_been_x. I.e. only people with inside information on Unicode  
Consrtium issues can make the calculations resulting in (various  
degrees of) stability.

(b) If algorithms like IDNAbis have to have stability, people have to  
base algorithms, sorting etc on is_or_has_been_x and not x, and then  
the change of codepoint a to remove x from it has no value in reality.  
There must be a reason why x was removed from A. But if  
is_or_has_been_x is what is used, that change is just void. So why  
changing? What will interoperability be between applications using x  
and ones using is_or_has_been_x?

This imply people will use the first property value ever assigned to  
the codepoint, and that changes are not interesting at all. The real  
property values will diverge from the derived ones, but the derived  
ones are still the most important ones for historical data.

This to me imply that changing property values is completely useless,  
part from making this a real mess.

(c) All of these claims that something is stable but not stable lead  
me to the conclusion that IDNAbis property can not be calculated on  
the properties Unicode Consortium has. Instead it has to be based on  
derived properties like
is_or_has_been_x, or rather, codepoints have to be hand picked to  
ensure stability.

And this open up the question whether Unicode codepoints should be  
used at all. IETF could as well use codepoints from ISO 10646 as the  
properties Unicode define do not give any extra value, and then this  
discussion can concentrate on what to do with the codepoints. IANA  
then hold a table of the properties based on ISO 10646.

So, before moving forward with IDNAbis, it might be that IETF will  
need a statement from UTC what properties will be stable in the  
future, and for what codepoints. Only that data is something the  
algorithms in the table document can be based upon.

I guess because of this the ball is again on your (as in unicode  
consortium) side of the ballpark.

In the meantime, I work on the table document and the good comments,  
including the ones from you Mark.

    Patrik

On 14 dec 2007, at 04.45, Mark Davis wrote:

> http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt
> Overall
> Comments:
>
>
> Tables-1.
>
> There is no operational difference between MAYBE YES and MAYBE NO,  
> and no
> characters that are in the latter. This distinction is really only
> meaningful as internal tracking information inside whatever group  
> controls
> the future allocation of characters and should not appear here. (See  
> also
> Ken's email and trail under "Table issues (was: Re: IDNAbis  
> documents)"
>
> Even further, MAYBE YES should not exist at all: a day or two of  
> work by
> script experts would be enough to move the vast majority of the  
> current
> 'MAYBE YES' to the ALWAYS category.
>
> Tables-2.
>
> There is a preference for Latin, Greek, Cyrillic, and Han which has no
> principled basis. In particular, Latin, Cyrillic, and Han are some  
> of the
> most complicated scripts: Latin and Cyrillic, since they ar used to  
> write a
> huge number of languages with a large number of variant characters,  
> and Han
> because of the history of character variations. Many, many scripts  
> are less
> problematic than Latin or Cyrillic, and there is no reason to favor  
> Cyrillic
> over say Armenian; it also gives the appearance of Eurocentrism  
> where none
> is intended.
>
>
> From an old email:
>
> "No reason is given for the focus on only European scripts; and that  
> focus
> will surely raise suspicions in many circles. While I'm sure that the
> restriction to European languages is just because those are the ones  
> the
> small group of authors is familiar with, it will not be received  
> well. If
> "we the community" have "experienced that a number of scripts have  
> issues
> that are not resolved", then those problems should be enumerated
> *explicitly*, not hidden away.
>
> The situation might be different if we were starting from zero; but  
> we are
> not. We already have an IDNA system that works for a great many  
> people. And
> while there are security problems with it, those are well known and  
> vendors
> are dealing with them. Moreover, of the problems that IDNAbis  
> solves, they
> are just the easy ones -- the harder ones are ones like the  
> "paypal.com"
> case, which the current suggestion for IDNAbis doesn't touch. So it  
> feels
> like we are looking at a proposal that:
>
> 1. doesn't actually help much with the practical problems that  
> people face
> 2. solves the easy problems, but not the hard ones; so people have to
> essentially do the work anyway
> 3. and removes much of the functionality, except for some favored  
> groups:
> Europe and the Americas"
>
> Tables-3.
>
> The CONTEXT class should be heavily restricted, as per Ken's email,  
> to only
> 2 characters (see "Table issues (Part 3: CONTEXT)" for details).  
> Moreover,
> the term Context is problematic: **many** characters are disallowed or
> allowed, depending on context. Even a-z are disallowed in a field  
> that also
> contains RTL characters.
>
> Tables-4.
>
> The list of historic scripts is very outdated. See
> http://www.unicode.org/reports/tr31/tr31-8.html#Specific_Character_Adjustmentsfor
> more details. The characters in Table 3 should also be reviewed as
> possible exceptions.
>
> Tables-5.
>
> Key to the success of this is the group that determines the future
> allocation of characters. It must be very clear precisely what the  
> grounds
> are for removing characters (moving from MAYBE to NEVER); otherwise  
> there
> will be never-ending battles over individual characters. (Frankly, I  
> believe
> that the correct course of action would be to disallow the historic  
> scripts
> for now, but allow the characters in all other scripts, with very few
> exceptions.)
>
> Tables-6.
>
> Like draft-alvestrand-idna-bidi-01.txt<http://www.ietf.org/internet-drafts/draft-alvestrand-idna-bidi-01.txt 
> >,
> there should be at least one example motivating every case where a  
> class of
> characters is removed (this might be in one of the other documents  
> instead
> of here).
>
> Tables-7.
>
> The entire description of the process is far too complicated for  
> what is, at
> core, a relatively simple process. It is further obfuscated by  
> referring to
> classes of characters by a letter category instead of a mnemonics.
>
> Take the following from
> draft-faltstrom-idnabis-tables-03.txt<http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt 
> >
>
>      *  If the codepoint does not appear in any of the categories B
>         (Section 2.1.2), C (Section 2.1.3), D (Section 2.1.4), E
>         (Section 2.1.5) or F (Section 2.1.6), the value is ALWAYS.
>
> That formulation is completely opaque. I'd strongly recommend for
> transparency you reformulate this considerably. You could maintain  
> part of
> the structure that you have, if you wanted, by consistently using  
> mnemonics
> instead of Sections.
>
> That is, give ,meaningful names to each Category in Section 2, such  
> as:
>
> A => Language-Characters
> B => Unnormalized
> C => Ignorable
> D => Historical-Scripts
> E => Disallowed-Blocks
> ...
>
> The formulation can then be something like the following. (This is not
> precisely equivalent to your formulation, which I found difficult to  
> follow
> -- it is the style of presentation that I'm focusing on).
>
> Use the following procedure to determine the IDNA-Property of any  
> code point
> cp. Proceed through the rules, and return a value at the first that  
> applies.
>
> Exceptions
> 1a. If cp is in Exceptional-Always, return Always
> 1b. If cp is in Exceptional-Never, return Never
> 1c. If cp is in Exceptional-Maybe, return Maybe
>
> Functional Exclusions
> 2. Else if cp is in Unnormalized, return Never
> 3. Else if cp is in Not-Case-Folded, return Never
> 4. Else if cp is in Ignorable, return Never
>
> Usage Exclusions
> 5. Else if cp is in Historical-Scripts, return Never
> 6. Else if cp is in Disallowed-Blocks, return Never
>
> LMN Inclusion
> 7. Else if cp is in Language-Characters, return Maybe
>
> Exclude everything else
> 8. Else return Never
>
> Note: Exceptional-Always would contain your Category H Always  
> characters,
> plus grandfathered Always characters, plus a-z, 0-9, -; Exceptional- 
> Maybe
> would add the Category H Maybe characters, and so on. The mechanism  
> already
> described in email for providing perfect stability would be to add
> characters, where necessary, to these classes.
>
> Details:
> Tables-8.
>
>      a character is never removed from
>      it unless it is removed from Unicode.
>
> This is not necessary. If you really have to have it, then add  
> "(however,
> the Unicode stability policies expressly forbid this)"
>
>
> Tables-9.
>
> Re. Appendix A. There seem to be some errors in the generation of this
> table. The code point range should be "0x0000 - 0x10FFFF".
>
>
> Tables-10
>
>
> The derivation of the table did not correctly distinguish  
> *unassigned* code
> points from *noncharacter* code points. Unassigned code points are
> "<reserved>" and are available for future encoding of characters,  
> whereas
> noncharacter code points are *not* "<reserved (for future  
> assignment)>" --
> they are designated functions, constitute a kind of internal private  
> use,
> and are disallowed for interchange. (See Table 2-3, TUS 5.0, p. 27.)  
> If PUA
> code points (e.g. U+E000..U+F8FF) are to be NEVER in this table,  
> then the
> noncharacters must be NEVER, rather than UNASSIGNED.
>
> Tables-10a
>
>
> In general, having this Appendix A listing include UNASSIGNED code  
> points is
> both distracting (from the other, more meaningful values) and an  
> error-prone
> reduplication of effort. The listing of gc=Cn values is already  
> available
> directly from:
>
> http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt
>
> And that file *does* make the distinction between true unassigned code
> points and noncharacter code points (both of which are gc=Cn, but  
> which
> differ in the Noncharacter_Code_Point property [see PropList.txt].)  
> The
> derivation for the IDN inclusion table needs to pay attention to  
> *both*
> gc=Cn and Noncharacter_Code_Point=True. What *would* make sense is  
> for the
> Appendix listing to correctly identify the noncharacters as NEVER.  
> The fact
> that it doesn't suggests that there is an error in the way the  
> calculation
> is handling Category D.
>
>
> Tables-11
>
>
> Another general issue with the document, table, and Section 3,  
> Calculation
> of the Derived Property: The possible values of the IDN property still
> include a value MAYBE NOT, but in fact the calculation has no branch  
> now
> that assigns a MAYBE NOT value, and the table contains on MAYBE NOT
> characters. Either the thinking about "MAYBE NOT" has changed, and the
> document hasn't caught up to that yet, or there is an error in how the
> calculation has been set up. As it is now, nearly all of the "MAYBE  
> NOT"
> values from the 01 version of this ID are now listed in the Appendix  
> as
> "NEVER". As "NEVER", they would be prohibited from any future  
> consideration
> for IDN, which seems at odds with the tenor of the text describing  
> "MAYBE
> NOT".
>
> Tables-12
>
>
> Section 4. Codepoints states:
>
> "The Categories and Rules defined in Section 2 and Section 3 apply  
> to all
> assigned Unicode characters." In fact they also apply to  
> *unassigned* code
> points as well.
>
> The correct formulation would be:
>
> "The Categories and Rules defined in Section 2 and Section 3 apply  
> to all
> Unicode codepoints, assigned or unassigned."
>
> [Note: the Unicode Standard systematically uses a space in the term  
> "code
> point", as well as for "code unit", "code position", "code value",  
> etc. But
> given that this document uses "codepoint" everywhere, I'm not  
> suggesting
> that be changed. Nobody is going to be confused as to what the word  
> means.]
>
>
> Tables-13
>
> "Once assigned to this category, a character is never removed from  
> it unless
> it is removed from Unicode."
>
> The qualification "unless it is removed from Unicode" is vacuous.  
> Since
> Unicode 1.1, no character ever has been removed from Unicode, nor  
> will any
> be -- in part because no character will ever be removed from ISO/IEC  
> 10646.
>
> So this is a quibble is a little like qualifying the definition of  
> ASCII LDH
> as "{0061..007A, 0030..0039, 002D} and no characters will be removed  
> from
> this definition unless they are removed from ASCII."
>
> So I suggest just removing the vacuous qualification.
>
>
> Tables-14
>
>
> The grandfathering technique needs to be used so as to preserve  
> stability,
> since characters may change script. (See the email trail under  
> "Table issues
> (Part 2)" for details).
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update