Comments on IDNAbis tables-03

JFC Morfin jefsey at jefsey.com
Sun Dec 16 18:25:09 CET 2007


Patrick and Mark,
the point is well made.

(1) ISO 10646 would certainly be a better approach in terms of code 
clarity and international policy, but would suffer from the same 
problem of being used for what it has not being designed for.

(2) In order to help Mark, and Unicode Members, understand the 
problem, it is the same I have with his RFC 4646, the other way 
around. For RFC 4646 langtags Mark needed language regions, these 
regions to be stable for ever, and did not want the ISO adopted ISO 
639-6 solution. The same as IDNA calls for eternal code points and he 
want IDNA to use Unicode.

What he did is what Patrick proposes. He took ISO 3166 country codes, 
named them "regions" and decided that langtags will use them as 
eternal geographical codes, not following ISO 3166/MA code changes. 
We opposed on the resulting change of the nature and control of the 
ISO 3166 codes. Patrick proposes the same thing. To use the current 
Unicode (or probably ISO 10646 for betterpolitical/pragmatic 
acceptance) and not to follow their changes, adapting the IANA code 
point table in its own way to potect its eternal stability. This 
would mean an IETF review system independent from the Unicode or ISO 
review system, leading to a loss of interoperability.

The RFC 4646 consensus we reached (the IESG unfortunately does not 
respect, so we lack experience) is that the review mechanism is an 
open official IANA hosted process, capable to welcome and be joined 
all the other concerned parties, so the updates/changes could be 
simultaneously approved by everyone in the same way (i.e. respecting 
the DNS demands as well). Experience shown us that the LSR Review 
process can also be improved, for example with a troika as a reviewer 
(proposed by Mark; I suggest stuctural members for that troika, so it 
is a way to officialise the cooperation with ISO, etc.). I think this 
is the best we can do as long as we use Unicode, short of transfering 
IDNA to Unicode and have them bear the registries' and registrants' 
hunappyness. The IETF should then join the work I call for in order 
to build a universal secure semiotic code and prepare an alternative 
IDN support solution, after i-DNs and Unicode.

jfc




At 11:26 16/12/2007, Patrik Fältström wrote:

>Before we go into the details of the comments of this document (thanks
>for those), I have to rise an overall issue that has been boiling for
>a while, and that has to do with stability of the properties defined
>by the Unicode Consortium. Reason why this discussion is needed before
>I start working on the overall issues you rise here will hopefully be
>more clear later.
>
>As data is stored in databases (like DNS) for a very very long time,
>anything that compare codepoints based on some property value MUST
>only use the derived property values that has the backward
>compatibility features you describe. Backwards compatible as in "if
>codepoint a have property x in version N of unicode, it should also
>have that property in version N+1".
>
>In early discussions on stability, you from UTC said that things WILL
>BE STABLE, and you personally have said so to the IETF several times.
>We have displayed the algorithms to you several times, and we have
>also said the calculations in the IDNAbis document will be based on
>base properties and not derived properties -- so that EVERYONE can
>easily calculate the derived value if they have the need for it. We
>have also agreed that things will NEVER move from ALWAYS or NEVER, and
>you have also been part of the discussion regarding Cyrillic and Latin
>(as those where said being "known").
>
>You come now and say revoke so many things you and other UTC people
>have stated in the IDNAbis discussion that I do not know how to
>continue the work.
>
>The rules will never be based on derived property values. People MUST
>be able to calculate the ALWAYS etc properties given the CURRENT
>Unicode distribution.
>
>The overall goal with IDNAbis is to be independent of Unicode Version.
>This implies it MUST be possible for anyone that have "the
>distribution of Unicode" to compute the value of the derived property
>that tell the status in IDNAbis. An alternative would be to have the
>derived property "just" appearing as a table that noone but a closed
>group can compute (or codepoints end up there in an ad-hoc based
>mechanism).
>
>This do though imply that the base properties the algorithm is based
>upon are stable. At least stable in the cases where IDNAbis is to
>ensure stability. And with this I imply for example "codepoints that
>are in ALWAYS will never be removed from ALWAYS" etc.
>
>This in turn imply the calculations and the properties that lead to a
>codepoint end up in ALWAYS or NEVER will never change in such a way
>that the calculations lead to a different result in the future.
>
>This is why Cyrillic, Latin etc where selected as pointers to
>codepoints that are believed to be stable so that we dare(!) to put
>codepoints in those scripts in the ALWAYS and NEVER categories. For
>other scripts we see changes between the versions of Unicode still.
>Changes that are large enough so they _CAN_ have implications on the
>domain names that are already stored in the DNS database in the form
>of registered domain names.
>
>The reason IETF require stability, as we have explained before, is
>that if a is registered as a domain name, a lookup for a should always
>give a match in the future. One must be able to use the domain name
>one have registered for all times in the future. This is what IDNAbis
>is concentrating on. Ensuring that if a is in ALWAYS and registered in
>DNS, it should stay there.
>
>If we then include that also b should be stable because f(b)=a (case
>folding etc) then we have a much larger problem. How can we ensure
>that b will continue to have the properties needed, and how can be
>ensure that the function f(x) is stable by itself?
>
>I have heard you say many times when we get this far in the discussion
>"but that is no problem". You even say below that MAYBE YES should be
>removed, as things very easily can be added to the ALWAYS category.
>
>But that is not a statement I agree with, and let me explain why. I
>have two points here to make:
>
>(1) There is currently a suggestion on the Unicore mailing list to
>move a codepoint from script cyrillic to inherited. This (if we would
>have taken inherited into account in the tables document) would move
>the codepoint from ALWAYS to CONTEXT according to my preliminary
>thinking. But that is not the point. The point is that suddenly I am,
>and many people should, be very very very afraid of including cyrillic
>script in the list of codepoints that are stable enough to have things
>in the ALWAYS category. Removing Cyrillic from there have implications
>on the ability to register codepoints using cyrillic as IDN domain
>names, and I am pretty sure that change will be discussed at the next
>meeting of the Internet Governance Forum. Russia have, as I hope you
>know, very strong feelings regarding use of Cyrillic "on the Internet".
>
>That a discussion even exist to change any properties regarding a
>codepoint that is part of the cyrillic script surprises me given the
>statements you have made regarding stability.
>
>(2) You have in mail to me said that properties not at all are stable.
>This is for me something that is completely orthogonal to statements
>similar to "it is easy for people knowing scripts to add more things
>to ALWAYS". You have further explained that stability is ensured by
>defining a new derived property in the following way:
>
>Say codepoint a have property x. As x is not a stable property (as no
>properties are stable) one have a derived property is_or_has_been_x
>that all codepoints have either have or have had that property has.
>This implies the codepoint a might no longer have property x, but will
>have property is_or_has_been_x. If we now base the IDNAbis tables on
>this derived property three things happens:
>
>(a) It is impossible for people outside unicode consortium to
>calculate the tables, as one can not know what codepoints have (since
>version N of Unicode) had the property value x, and because of that it
>is impossible to know what codepoints have property value
>is_or_has_been_x. I.e. only people with inside information on Unicode
>Consrtium issues can make the calculations resulting in (various
>degrees of) stability.
>
>(b) If algorithms like IDNAbis have to have stability, people have to
>base algorithms, sorting etc on is_or_has_been_x and not x, and then
>the change of codepoint a to remove x from it has no value in reality.
>There must be a reason why x was removed from A. But if
>is_or_has_been_x is what is used, that change is just void. So why
>changing? What will interoperability be between applications using x
>and ones using is_or_has_been_x?
>
>This imply people will use the first property value ever assigned to
>the codepoint, and that changes are not interesting at all. The real
>property values will diverge from the derived ones, but the derived
>ones are still the most important ones for historical data.
>
>This to me imply that changing property values is completely useless,
>part from making this a real mess.
>
>(c) All of these claims that something is stable but not stable lead
>me to the conclusion that IDNAbis property can not be calculated on
>the properties Unicode Consortium has. Instead it has to be based on
>derived properties like
>is_or_has_been_x, or rather, codepoints have to be hand picked to
>ensure stability.
>
>And this open up the question whether Unicode codepoints should be
>used at all. IETF could as well use codepoints from ISO 10646 as the
>properties Unicode define do not give any extra value, and then this
>discussion can concentrate on what to do with the codepoints. IANA
>then hold a table of the properties based on ISO 10646.
>
>So, before moving forward with IDNAbis, it might be that IETF will
>need a statement from UTC what properties will be stable in the
>future, and for what codepoints. Only that data is something the
>algorithms in the table document can be based upon.
>
>I guess because of this the ball is again on your (as in unicode
>consortium) side of the ballpark.
>
>In the meantime, I work on the table document and the good comments,
>including the ones from you Mark.
>
>    Patrik
>
>On 14 dec 2007, at 04.45, Mark Davis wrote:
>
>>http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt
>>Overall
>>Comments:
>>
>>
>>Tables-1.
>>
>>There is no operational difference between MAYBE YES and MAYBE NO,
>>and no
>>characters that are in the latter. This distinction is really only
>>meaningful as internal tracking information inside whatever group
>>controls
>>the future allocation of characters and should not appear here. (See
>>also
>>Ken's email and trail under "Table issues (was: Re: IDNAbis
>>documents)"
>>
>>Even further, MAYBE YES should not exist at all: a day or two of
>>work by
>>script experts would be enough to move the vast majority of the
>>current
>>'MAYBE YES' to the ALWAYS category.
>>
>>Tables-2.
>>
>>There is a preference for Latin, Greek, Cyrillic, and Han which has no
>>principled basis. In particular, Latin, Cyrillic, and Han are some
>>of the
>>most complicated scripts: Latin and Cyrillic, since they ar used to
>>write a
>>huge number of languages with a large number of variant characters,
>>and Han
>>because of the history of character variations. Many, many scripts
>>are less
>>problematic than Latin or Cyrillic, and there is no reason to favor
>>Cyrillic
>>over say Armenian; it also gives the appearance of Eurocentrism
>>where none
>>is intended.
>>
>>
>> From an old email:
>>
>>"No reason is given for the focus on only European scripts; and that
>>focus
>>will surely raise suspicions in many circles. While I'm sure that the
>>restriction to European languages is just because those are the ones
>>the
>>small group of authors is familiar with, it will not be received
>>well. If
>>"we the community" have "experienced that a number of scripts have
>>issues
>>that are not resolved", then those problems should be enumerated
>>*explicitly*, not hidden away.
>>
>>The situation might be different if we were starting from zero; but
>>we are
>>not. We already have an IDNA system that works for a great many
>>people. And
>>while there are security problems with it, those are well known and
>>vendors
>>are dealing with them. Moreover, of the problems that IDNAbis
>>solves, they
>>are just the easy ones -- the harder ones are ones like the
>>"paypal.com"
>>case, which the current suggestion for IDNAbis doesn't touch. So it
>>feels
>>like we are looking at a proposal that:
>>
>>1. doesn't actually help much with the practical problems that
>>people face
>>2. solves the easy problems, but not the hard ones; so people have to
>>essentially do the work anyway
>>3. and removes much of the functionality, except for some favored
>>groups:
>>Europe and the Americas"
>>
>>Tables-3.
>>
>>The CONTEXT class should be heavily restricted, as per Ken's email,
>>to only
>>2 characters (see "Table issues (Part 3: CONTEXT)" for details).
>>Moreover,
>>the term Context is problematic: **many** characters are disallowed or
>>allowed, depending on context. Even a-z are disallowed in a field
>>that also
>>contains RTL characters.
>>
>>Tables-4.
>>
>>The list of historic scripts is very outdated. See
>>http://www.unicode.org/reports/tr31/tr31-8.html#Specific_Character_Adjustmentsfor
>>more details. The characters in Table 3 should also be reviewed as
>>possible exceptions.
>>
>>Tables-5.
>>
>>Key to the success of this is the group that determines the future
>>allocation of characters. It must be very clear precisely what the
>>grounds
>>are for removing characters (moving from MAYBE to NEVER); otherwise
>>there
>>will be never-ending battles over individual characters. (Frankly, I
>>believe
>>that the correct course of action would be to disallow the historic
>>scripts
>>for now, but allow the characters in all other scripts, with very few
>>exceptions.)
>>
>>Tables-6.
>>
>>Like 
>>draft-alvestrand-idna-bidi-01.txt<http://www.ietf.org/internet-drafts/draft-alvestrand-idna-bidi-01.txt  
>> >,
>>there should be at least one example motivating every case where a
>>class of
>>characters is removed (this might be in one of the other documents
>>instead
>>of here).
>>
>>Tables-7.
>>
>>The entire description of the process is far too complicated for
>>what is, at
>>core, a relatively simple process. It is further obfuscated by
>>referring to
>>classes of characters by a letter category instead of a mnemonics.
>>
>>Take the following from
>>draft-faltstrom-idnabis-tables-03.txt<http://www.ietf.org/internet-drafts/draft-faltstrom-idnabis-tables-03.txt  
>> >
>>
>>      *  If the codepoint does not appear in any of the categories B
>>         (Section 2.1.2), C (Section 2.1.3), D (Section 2.1.4), E
>>         (Section 2.1.5) or F (Section 2.1.6), the value is ALWAYS.
>>
>>That formulation is completely opaque. I'd strongly recommend for
>>transparency you reformulate this considerably. You could maintain
>>part of
>>the structure that you have, if you wanted, by consistently using
>>mnemonics
>>instead of Sections.
>>
>>That is, give ,meaningful names to each Category in Section 2, such
>>as:
>>
>>A => Language-Characters
>>B => Unnormalized
>>C => Ignorable
>>D => Historical-Scripts
>>E => Disallowed-Blocks
>>...
>>
>>The formulation can then be something like the following. (This is not
>>precisely equivalent to your formulation, which I found difficult to
>>follow
>>-- it is the style of presentation that I'm focusing on).
>>
>>Use the following procedure to determine the IDNA-Property of any
>>code point
>>cp. Proceed through the rules, and return a value at the first that
>>applies.
>>
>>Exceptions
>>1a. If cp is in Exceptional-Always, return Always
>>1b. If cp is in Exceptional-Never, return Never
>>1c. If cp is in Exceptional-Maybe, return Maybe
>>
>>Functional Exclusions
>>2. Else if cp is in Unnormalized, return Never
>>3. Else if cp is in Not-Case-Folded, return Never
>>4. Else if cp is in Ignorable, return Never
>>
>>Usage Exclusions
>>5. Else if cp is in Historical-Scripts, return Never
>>6. Else if cp is in Disallowed-Blocks, return Never
>>
>>LMN Inclusion
>>7. Else if cp is in Language-Characters, return Maybe
>>
>>Exclude everything else
>>8. Else return Never
>>
>>Note: Exceptional-Always would contain your Category H Always
>>characters,
>>plus grandfathered Always characters, plus a-z, 0-9, -; Exceptional- Maybe
>>would add the Category H Maybe characters, and so on. The mechanism
>>already
>>described in email for providing perfect stability would be to add
>>characters, where necessary, to these classes.
>>
>>Details:
>>Tables-8.
>>
>>      a character is never removed from
>>      it unless it is removed from Unicode.
>>
>>This is not necessary. If you really have to have it, then add
>>"(however,
>>the Unicode stability policies expressly forbid this)"
>>
>>
>>Tables-9.
>>
>>Re. Appendix A. There seem to be some errors in the generation of this
>>table. The code point range should be "0x0000 - 0x10FFFF".
>>
>>
>>Tables-10
>>
>>
>>The derivation of the table did not correctly distinguish
>>*unassigned* code
>>points from *noncharacter* code points. Unassigned code points are
>>"<reserved>" and are available for future encoding of characters,
>>whereas
>>noncharacter code points are *not* "<reserved (for future
>>assignment)>" --
>>they are designated functions, constitute a kind of internal private
>>use,
>>and are disallowed for interchange. (See Table 2-3, TUS 5.0, p. 27.)
>>If PUA
>>code points (e.g. U+E000..U+F8FF) are to be NEVER in this table,
>>then the
>>noncharacters must be NEVER, rather than UNASSIGNED.
>>
>>Tables-10a
>>
>>
>>In general, having this Appendix A listing include UNASSIGNED code
>>points is
>>both distracting (from the other, more meaningful values) and an
>>error-prone
>>reduplication of effort. The listing of gc=Cn values is already
>>available
>>directly from:
>>
>>http://www.unicode.org/Public/UNIDATA/extracted/DerivedGeneralCategory.txt
>>
>>And that file *does* make the distinction between true unassigned code
>>points and noncharacter code points (both of which are gc=Cn, but
>>which
>>differ in the Noncharacter_Code_Point property [see PropList.txt].)
>>The
>>derivation for the IDN inclusion table needs to pay attention to
>>*both*
>>gc=Cn and Noncharacter_Code_Point=True. What *would* make sense is
>>for the
>>Appendix listing to correctly identify the noncharacters as NEVER.
>>The fact
>>that it doesn't suggests that there is an error in the way the
>>calculation
>>is handling Category D.
>>
>>
>>Tables-11
>>
>>
>>Another general issue with the document, table, and Section 3,
>>Calculation
>>of the Derived Property: The possible values of the IDN property still
>>include a value MAYBE NOT, but in fact the calculation has no branch
>>now
>>that assigns a MAYBE NOT value, and the table contains on MAYBE NOT
>>characters. Either the thinking about "MAYBE NOT" has changed, and the
>>document hasn't caught up to that yet, or there is an error in how the
>>calculation has been set up. As it is now, nearly all of the "MAYBE
>>NOT"
>>values from the 01 version of this ID are now listed in the Appendix
>>as
>>"NEVER". As "NEVER", they would be prohibited from any future
>>consideration
>>for IDN, which seems at odds with the tenor of the text describing
>>"MAYBE
>>NOT".
>>
>>Tables-12
>>
>>
>>Section 4. Codepoints states:
>>
>>"The Categories and Rules defined in Section 2 and Section 3 apply
>>to all
>>assigned Unicode characters." In fact they also apply to
>>*unassigned* code
>>points as well.
>>
>>The correct formulation would be:
>>
>>"The Categories and Rules defined in Section 2 and Section 3 apply
>>to all
>>Unicode codepoints, assigned or unassigned."
>>
>>[Note: the Unicode Standard systematically uses a space in the term
>>"code
>>point", as well as for "code unit", "code position", "code value",
>>etc. But
>>given that this document uses "codepoint" everywhere, I'm not
>>suggesting
>>that be changed. Nobody is going to be confused as to what the word
>>means.]
>>
>>
>>Tables-13
>>
>>"Once assigned to this category, a character is never removed from
>>it unless
>>it is removed from Unicode."
>>
>>The qualification "unless it is removed from Unicode" is vacuous.
>>Since
>>Unicode 1.1, no character ever has been removed from Unicode, nor
>>will any
>>be -- in part because no character will ever be removed from ISO/IEC
>>10646.
>>
>>So this is a quibble is a little like qualifying the definition of
>>ASCII LDH
>>as "{0061..007A, 0030..0039, 002D} and no characters will be removed
>>from
>>this definition unless they are removed from ASCII."
>>
>>So I suggest just removing the vacuous qualification.
>>
>>
>>Tables-14
>>
>>
>>The grandfathering technique needs to be used so as to preserve
>>stability,
>>since characters may change script. (See the email trail under
>>"Table issues
>>(Part 2)" for details).
>>_______________________________________________
>>Idna-update mailing list
>>Idna-update at alvestrand.no
>>http://www.alvestrand.no/mailman/listinfo/idna-update
>
>_______________________________________________
>Idna-update mailing list
>Idna-update at alvestrand.no
>http://www.alvestrand.no/mailman/listinfo/idna-update



More information about the Idna-update mailing list