Comments on IDNAbis tables-03
Patrik Fältström
patrik at frobbit.se
Sun Dec 16 11:26:59 CET 2007
Before we go into the details of the comments of this document (thanks
for those), I have to rise an overall issue that has been boiling for
a while, and that has to do with stability of the properties defined
by the Unicode Consortium. Reason why this discussion is needed before
I start working on the overall issues you rise here will hopefully be
more clear later.
As data is stored in databases (like DNS) for a very very long time,
anything that compare codepoints based on some property value MUST
only use the derived property values that has the backward
compatibility features you describe. Backwards compatible as in "if
codepoint a have property x in version N of unicode, it should also
have that property in version N+1".
In early discussions on stability, you from UTC said that things WILL
BE STABLE, and you personally have said so to the IETF several times.
We have displayed the algorithms to you several times, and we have
also said the calculations in the IDNAbis document will be based on
base properties and not derived properties -- so that EVERYONE can
easily calculate the derived value if they have the need for it. We
have also agreed that things will NEVER move from ALWAYS or NEVER, and
you have also been part of the discussion regarding Cyrillic and Latin
(as those where said being "known").
You come now and say revoke so many things you and other UTC people
have stated in the IDNAbis discussion that I do not know how to
continue the work.
The rules will never be based on derived property values. People MUST
be able to calculate the ALWAYS etc properties given the CURRENT
Unicode distribution.
The overall goal with IDNAbis is to be independent of Unicode Version.
This implies it MUST be possible for anyone that have "the
distribution of Unicode" to compute the value of the derived property
that tell the status in IDNAbis. An alternative would be to have the
derived property "just" appearing as a table that noone but a closed
group can compute (or codepoints end up there in an ad-hoc based
mechanism).
This do though imply that the base properties the algorithm is based
upon are stable. At least stable in the cases where IDNAbis is to
ensure stability. And with this I imply for example "codepoints that
are in ALWAYS will never be removed from ALWAYS" etc.
This in turn imply the calculations and the properties that lead to a
codepoint end up in ALWAYS or NEVER will never change in such a way
that the calculations lead to a different result in the future.
This is why Cyrillic, Latin etc where selected as pointers to
codepoints that are believed to be stable so that we dare(!) to put
codepoints in those scripts in the ALWAYS and NEVER categories. For
other scripts we see changes between the versions of Unicode still.
Changes that are large enough so they _CAN_ have implications on the
domain names that are already stored in the DNS database in the form
of registered domain names.
The reason IETF require stability, as we have explained before, is
that if a is registered as a domain name, a lookup for a should always
give a match in the future. One must be able to use the domain name
one have registered for all times in the future. This is what IDNAbis
is concentrating on. Ensuring that if a is in ALWAYS and registered in
DNS, it should stay there.
If we then include that also b should be stable because f(b)=a (case
folding etc) then we have a much larger problem. How can we ensure
that b will continue to have the properties needed, and how can be
ensure that the function f(x) is stable by itself?
I have heard you say many times when we get this far in the discussion
"but that is no problem". You even say below that MAYBE YES should be
removed, as things very easily can be added to the ALWAYS category.
But that is not a statement I agree with, and let me explain why. I
have two points here to make:
(1) There is currently a suggestion on the Unicore mailing list to
move a codepoint from script cyrillic to inherited. This (if we would
have taken inherited into account in the tables document) would move
the codepoint from ALWAYS to CONTEXT according to my preliminary
thinking. But that is not the point. The point is that suddenly I am,
and many people should, be very very very afraid of including cyrillic
script in the list of codepoints that are stable enough to have things
in the ALWAYS category. Removing Cyrillic from there have implications
on the ability to register codepoints using cyrillic as IDN domain
names, and I am pretty sure that change will be discussed at the next
meeting of the Internet Governance Forum. Russia have, as I hope you
know, very strong feelings regarding use of Cyrillic "on the Internet".
That a discussion even exist to change any properties regarding a
codepoint that is part of the cyrillic script surprises me given the
statements you have made regarding stability.
(2) You have in mail to me said that properties not at all are stable.
This is for me something that is completely orthogonal to statements
similar to "it is easy for people knowing scripts to add more things
to ALWAYS". You have further explained that stability is ensured by
defining a new derived property in the following way:
Say codepoint a have property x. As x is not a stable property (as no
properties are stable) one have a derived property is_or_has_been_x
that all codepoints have either have or have had that property has.
This implies the codepoint a might no longer have property x, but will
have property is_or_has_been_x. If we now base the IDNAbis tables on
this derived property three things happens:
(a) It is impossible for people outside unicode consortium to
calculate the tables, as one can not know what codepoints have (since
version N of Unicode) had the property value x, and because of that it
is impossible to know what codepoints have property value
is_or_has_been_x. I.e. only people with inside information on Unicode
Consrtium issues can make the calculations resulting in (various
degrees of) stability.
(b) If algorithms like IDNAbis have to have stability, people have to
base algorithms, sorting etc on is_or_has_been_x and not x, and then
the change of codepoint a to remove x from it has no value in reality.
There must be a reason why x was removed from A. But if
is_or_has_been_x is what is used, that change is just void. So why
changing? What will interoperability be between applications using x
and ones using is_or_has_been_x?
This imply people will use the first property value ever assigned to
the codepoint, and that changes are not interesting at all. The real
property values will diverge from the derived ones, but the derived
ones are still the most important ones for historical data.
This to me imply that changing property values is completely useless,
part from making this a real mess.
(c) All of these claims that something is stable but not stable lead
me to the conclusion that IDNAbis property can not be calculated on
the properties Unicode Consortium has. Instead it has to be based on
derived properties like
is_or_has_been_x, or rather, codepoints have to be hand picked to
ensure stability.
And this open up the question whether Unicode codepoints should be
used at all. IETF could as well use codepoints from ISO 10646 as the
properties Unicode define do not give any extra value, and then this
discussion can concentrate on what to do with the codepoints. IANA
then hold a table of the properties based on ISO 10646.
So, before moving forward with IDNAbis, it might be that IETF will
need a statement from UTC what properties will be stable in the
future, and for what codepoints. Only that data is something the
algorithms in the table document can be based upon.
I guess because of this the ball is again on your (as in unicode
consortium) side of the ballpark.
In the meantime, I work on the table document and the good comments,
including the ones from you Mark.
Patrik
On 14 dec 2007, at 04.45, Mark Davis wrote:
> http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt
> Overall
> Comments:
>
>
> Tables-1.
>
> There is no operational difference between MAYBE YES and MAYBE NO,
> and no
> characters that are in the latter. This distinction is really only
> meaningful as internal tracking information inside whatever group
> controls
> the future allocation of characters and should not appear here. (See
> also
> Ken's email and trail under "Table issues (was: Re: IDNAbis
> documents)"
>
> Even further, MAYBE YES should not exist at all: a day or two of
> work by
> script experts would be enough to move the vast majority of the
> current
> 'MAYBE YES' to the ALWAYS category.
>
> Tables-2.
>
> There is a preference for Latin, Greek, Cyrillic, and Han which has no
> principled basis. In particular, Latin, Cyrillic, and Han are some
> of the
> most complicated scripts: Latin and Cyrillic, since they ar used to
> write a
> huge number of languages with a large number of variant characters,
> and Han
> because of the history of character variations. Many, many scripts
> are less
> problematic than Latin or Cyrillic, and there is no reason to favor
> Cyrillic
> over say Armenian; it also gives the appearance of Eurocentrism
> where none
> is intended.
>
>
> From an old email:
>
> "No reason is given for the focus on only European scripts; and that
> focus
> will surely raise suspicions in many circles. While I'm sure that the
> restriction to European languages is just because those are the ones
> the
> small group of authors is familiar with, it will not be received
> well. If
> "we the community" have "experienced that a number of scripts have
> issues
> that are not resolved", then those problems should be enumerated
> *explicitly*, not hidden away.
>
> The situation might be different if we were starting from zero; but
> we are
> not. We already have an IDNA system that works for a great many
> people. And
> while there are security problems with it, those are well known and
> vendors
> are dealing with them. Moreover, of the problems that IDNAbis
> solves, they
> are just the easy ones -- the harder ones are ones like the
> "paypal.com"
> case, which the current suggestion for IDNAbis doesn't touch. So it
> feels
> like we are looking at a proposal that:
>
> 1. doesn't actually help much with the practical problems that
> people face
> 2. solves the easy problems, but not the hard ones; so people have to
> essentially do the work anyway
> 3. and removes much of the functionality, except for some favored
> groups:
> Europe and the Americas"
>
> Tables-3.
>
> The CONTEXT class should be heavily restricted, as per Ken's email,
> to only
> 2 characters (see "Table issues (Part 3: CONTEXT)" for details).
> Moreover,
> the term Context is problematic: **many** characters are disallowed or
> allowed, depending on context. Even a-z are disallowed in a field
> that also
> contains RTL characters.
>
> Tables-4.
>
> The list of historic scripts is very outdated. See
> http://www.unicode.org/reports/tr31/tr31-8.html#Specific_Character_Adjustmentsfor
> more details. The characters in Table 3 should also be reviewed as
> possible exceptions.
>
> Tables-5.
>
> Key to the success of this is the group that determines the future
> allocation of characters. It must be very clear precisely what the
> grounds
> are for removing characters (moving from MAYBE to NEVER); otherwise
> there
> will be never-ending battles over individual characters. (Frankly, I
> believe
> that the correct course of action would be to disallow the historic
> scripts
> for now, but allow the characters in all other scripts, with very few
> exceptions.)
>
> Tables-6.
>
> Like draft-alvestrand-idna-bidi-01.txt<http://www.ietf.org/internet-drafts/draft-alvestrand-idna-bidi-01.txt
> >,
> there should be at least one example motivating every case where a
> class of
> characters is removed (this might be in one of the other documents
> instead
> of here).
>
> Tables-7.
>
> The entire description of the process is far too complicated for
> what is, at
> core, a relatively simple process. It is further obfuscated by
> referring to
> classes of characters by a letter category instead of a mnemonics.
>
> Take the following from
> draft-faltstrom-idnabis-tables-03.txt<http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt
> >
>
> * If the codepoint does not appear in any of the categories B
> (Section 2.1.2), C (Section 2.1.3), D (Section 2.1.4), E
> (Section 2.1.5) or F (Section 2.1.6), the value is ALWAYS.
>
> That formulation is completely opaque. I'd strongly recommend for
> transparency you reformulate this considerably. You could maintain
> part of
> the structure that you have, if you wanted, by consistently using
> mnemonics
> instead of Sections.
>
> That is, give ,meaningful names to each Category in Section 2, such
> as:
>
> A => Language-Characters
> B => Unnormalized
> C => Ignorable
> D => Historical-Scripts
> E => Disallowed-Blocks
> ...
>
> The formulation can then be something like the following. (This is not
> precisely equivalent to your formulation, which I found difficult to
> follow
> -- it is the style of presentation that I'm focusing on).
>
> Use the following procedure to determine the IDNA-Property of any
> code point
> cp. Proceed through the rules, and return a value at the first that
> applies.
>
> Exceptions
> 1a. If cp is in Exceptional-Always, return Always
> 1b. If cp is in Exceptional-Never, return Never
> 1c. If cp is in Exceptional-Maybe, return Maybe
>
> Functional Exclusions
> 2. Else if cp is in Unnormalized, return Never
> 3. Else if cp is in Not-Case-Folded, return Never
> 4. Else if cp is in Ignorable, return Never
>
> Usage Exclusions
> 5. Else if cp is in Historical-Scripts, return Never
> 6. Else if cp is in Disallowed-Blocks, return Never
>
> LMN Inclusion
> 7. Else if cp is in Language-Characters, return Maybe
>
> Exclude everything else
> 8. Else return Never
>
> Note: Exceptional-Always would contain your Category H Always
> characters,
> plus grandfathered Always characters, plus a-z, 0-9, -; Exceptional-
> Maybe
> would add the Category H Maybe characters, and so on. The mechanism
> already
> described in email for providing perfect stability would be to add
> characters, where necessary, to these classes.
>
> Details:
> Tables-8.
>
> a character is never removed from
> it unless it is removed from Unicode.
>
> This is not necessary. If you really have to have it, then add
> "(however,
> the Unicode stability policies expressly forbid this)"
>
>
> Tables-9.
>
> Re. Appendix A. There seem to be some errors in the generation of this
> table. The code point range should be "0x0000 - 0x10FFFF".
>
>
> Tables-10
>
>
> The derivation of the table did not correctly distinguish
> *unassigned* code
> points from *noncharacter* code points. Unassigned code points are
> "<reserved>" and are available for future encoding of characters,
> whereas
> noncharacter code points are *not* "<reserved (for future
> assignment)>" --
> they are designated functions, constitute a kind of internal private
> use,
> and are disallowed for interchange. (See Table 2-3, TUS 5.0, p. 27.)
> If PUA
> code points (e.g. U+E000..U+F8FF) are to be NEVER in this table,
> then the
> noncharacters must be NEVER, rather than UNASSIGNED.
>
> Tables-10a
>
>
> In general, having this Appendix A listing include UNASSIGNED code
> points is
> both distracting (from the other, more meaningful values) and an
> error-prone
> reduplication of effort. The listing of gc=Cn values is already
> available
> directly from:
>
> http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt
>
> And that file *does* make the distinction between true unassigned code
> points and noncharacter code points (both of which are gc=Cn, but
> which
> differ in the Noncharacter_Code_Point property [see PropList.txt].)
> The
> derivation for the IDN inclusion table needs to pay attention to
> *both*
> gc=Cn and Noncharacter_Code_Point=True. What *would* make sense is
> for the
> Appendix listing to correctly identify the noncharacters as NEVER.
> The fact
> that it doesn't suggests that there is an error in the way the
> calculation
> is handling Category D.
>
>
> Tables-11
>
>
> Another general issue with the document, table, and Section 3,
> Calculation
> of the Derived Property: The possible values of the IDN property still
> include a value MAYBE NOT, but in fact the calculation has no branch
> now
> that assigns a MAYBE NOT value, and the table contains on MAYBE NOT
> characters. Either the thinking about "MAYBE NOT" has changed, and the
> document hasn't caught up to that yet, or there is an error in how the
> calculation has been set up. As it is now, nearly all of the "MAYBE
> NOT"
> values from the 01 version of this ID are now listed in the Appendix
> as
> "NEVER". As "NEVER", they would be prohibited from any future
> consideration
> for IDN, which seems at odds with the tenor of the text describing
> "MAYBE
> NOT".
>
> Tables-12
>
>
> Section 4. Codepoints states:
>
> "The Categories and Rules defined in Section 2 and Section 3 apply
> to all
> assigned Unicode characters." In fact they also apply to
> *unassigned* code
> points as well.
>
> The correct formulation would be:
>
> "The Categories and Rules defined in Section 2 and Section 3 apply
> to all
> Unicode codepoints, assigned or unassigned."
>
> [Note: the Unicode Standard systematically uses a space in the term
> "code
> point", as well as for "code unit", "code position", "code value",
> etc. But
> given that this document uses "codepoint" everywhere, I'm not
> suggesting
> that be changed. Nobody is going to be confused as to what the word
> means.]
>
>
> Tables-13
>
> "Once assigned to this category, a character is never removed from
> it unless
> it is removed from Unicode."
>
> The qualification "unless it is removed from Unicode" is vacuous.
> Since
> Unicode 1.1, no character ever has been removed from Unicode, nor
> will any
> be -- in part because no character will ever be removed from ISO/IEC
> 10646.
>
> So this is a quibble is a little like qualifying the definition of
> ASCII LDH
> as "{0061..007A, 0030..0039, 002D} and no characters will be removed
> from
> this definition unless they are removed from ASCII."
>
> So I suggest just removing the vacuous qualification.
>
>
> Tables-14
>
>
> The grandfathering technique needs to be used so as to preserve
> stability,
> since characters may change script. (See the email trail under
> "Table issues
> (Part 2)" for details).
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
More information about the Idna-update
mailing list