Table Derivation

Sat Dec 22 01:26:56 CET 2007

Dear Kenneth,
all this expertise deployed to answer the minimum set of requirements 
established by Patrick Fälström and support something which should be 
the most straightforward thing (to write a few characters to form a 
domain name) is obviously impressive.

However, at the same time it is frightening. How could anyone decide 
to entrust (this is what the name system is about) the world economy, 
people's life, nations defense, cultural empowerment, etc. on a such 
a computer foreign complex thing. I doubt the IETF has the available 
expertise to check this document, so any RFC proposition including 
this propositions will not survive the IETF LC razor (cf. RFC 3935). 
And if it would, how would ICANN, ccTLDs, Governments, etc. be 
convinced and more motivated than for IPv6?

Again, I see only two solutions:
- either Unicode Members are able to consensually understand and 
commit that you are 100% right and 100% comprehensive, meaning to 
entrust the life of their kids into this document (this is what you 
ask to non-English speaking people who would use the deliverable). 
Then the IDNA must be a Unicode consortium published, promoted, 
supported, and maintained proposition.
- or another approach must be found. Most probably starting with a 
grapheme grid and a network presentation layer.

I am sorry.
jfc

At 22:51 21/12/2007, Kenneth Whistler wrote:
>Patrik,
>
>In an effort to try to make the detailed technical feedback
>on the table derivation in draft-faltstrom-idnabis-tables-03.txt
>as straightforward as possible, while still reflecting what
>we think are necessary corrections to the various categories
>and derivation rules, we have written out a category-by-category
>listing, with updates, following the lettering and order in
>the draft, followed by a suggested simplification of the
>required derivation rules.
>
>First, to guide the discussion, we will try to capture what we
>think is the *intent* of the main property values proposed
>in the draft: ALWAYS, NEVER, MAYBE, CONTEXT, and UNASSIGNED.
>
>NEVER consists of those characters that we want to categorically
>rule out for IDNs, and should include:
>    * characters that are neither letters, marks, nor digits
>    * characters unstable under NFKC normalization
>    * characters unstable under full case folding
>    * default-ignorable characters (including control
>        characters, noncharacters, variation selectors, etc.)
>    * private-use characters
>    * a short list of additional blocks not appropriate
>        for IDNs
>
>ALWAYS consists of those characters that we want to categorically
>guarantee are available for IDNs (at the protocol level, although
>there could always be additional restrictions at other levels),
>and should include:
>     * letters, digits, and combining marks
>     * in particular, ASCII LDH
>     * a small number of exceptional punctuation characters,
>         for various reasons, such as MIDDLE DOT
>but should exclude:
>     * anything in the NEVER category
>     * a number of historic scripts for which there is no good
>         argument currently to require them for IDNs
>
>CONTEXT consists of those characters that are required for IDNs
>*only* because of certain contextual rules, and which are not
>otherwise already specified to be ALWAYS. These include only:
>     * join control characters (U+200C ZWNJ, U+200D ZWJ)
>
>UNASSIGNED consist of all Unicode code points not assigned
>(in any particular version).
>
>MAYBE consists of all other Unicode code points not determined
>to be ALWAYS or NEVER or CONTEXT or UNASSIGNED, and in particular,
>includes:
>     * other assigned characters, including the historic scripts
>
>If we can get general consensus that this is what we are trying
>to accomplish with these values defined by the table for IDNA,
>then it is possible to reexamine the proposed specific categories
>used in the derivation of the table.
>
>So, moving on to those, in order:
>
>*************************************************************
>
>2.1.1 Category A - Classes of Codepoints
>
>   A: generalCategory(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}
>
>   Let's give this category an actually meaningful and
>   mnemonic label:
>
>   Letters_Digits_Marks
>
>   And in Unicode regex notation, using formal property
>   values, this can be expressed as:
>
>   [[:L:][:Nd:][:Mn:][:Mc:]]
>
>   Note: "[:L:]" actually includes Ll, Lu, Lo, Lm, *and* Lt.
>   As it turns out, all of the Lt and almost all of the Lu
>   end up in NEVER because of the effect of full case folding,
>   but it is simplest and most comprehensible to simply
>   define Letters_Digits_Marks per se as including *all* the
>   letters to start with.
>
>*************************************************************
>
>2.1.2 Category B - Normalization
>
>   B: NFKC(cp) != cp
>
>   As before, a meaningful and mnemonic label:
>
>   Unstable_Under_NFKC
>
>   Now, as stated, this is a condition that requires a
>   conformant Unicode normalization API to check each
>   code point and compute the result of the expression.
>   However, it turns out that for this particular relation,
>   there is a precomputed value stored in the Unicode
>   Character Database files for each version of the
>   standard. That is the derived NFKC_QuickCheck property,
>   listed in DerivedNormalizationProps.txt in the UCD.
>   And in Unicode regex notation, using formal property
>   values, the condition we need can be expressed as:
>
>   [:NFKC_QuickCheck=NO:]
>
>   or
>
>   [:NFKC_QC=N:]
>
>   What this means is the the IDNA table derivation, for any
>   particular version of the Unicode Standard, doesn't actually
>   have to be implemented in code that depends on a conformant
>   normalization API. It can simply get the results as
>   a list from the data file, DerivedNormalizationProps.txt.
>
>*************************************************************
>
>2.1.3. Category C - Casefolding
>
>   C: casefold(cp) != cp
>
>   A meaningful and mnemonic label:
>
>   Unstable_Under_Full_Case_Folding
>
>   Computing this category also implies a Unicode API to check
>   the case folding status of each character to calculate
>   the result of the expression. However, there is a
>   simple way to reliably get the correct result. If the
>   character is listed in the UCD data file CaseFolding.txt,
>   then it is unstable under full case folding -- precisely
>   because it has an entry in that file specifying its
>   folding (to lowercase) to something other than itself.
>   So computing the exact value of this category,
>   Unstable_Under_Full_Case_Folding is simply a matter of
>   extracting the list of code points in CaseFolding.txt.
>
>   Unicode regex defines POSIX terminology for this as well,
>   and for an implementation that supports Unicode regex,
>   this is equivalent to the expression:
>
>   [:^isCaseFolded:]
>
>   ^isCaseFolded means the class of characters that
>   are not already case folded, because if you case fold
>   them, they turn into some other character (or sequence),
>   i.e., it is equivalent to the class of characters that
>   are Unstable_Under_Full_Case_Folding.
>
>*************************************************************
>
>2.1.4. Category D - Ignorables
>
>   D: property(cp) is in {Other_Default_Ignorable_Code_Point,
>                          Noncharacter_Code_Point}
>
>   A meaningful and mnemonic label:
>
>   Ignorables
>
>   The definition of Ignorables in 2.1.4 currently is
>   problematical in a number of ways, because of the
>   ignorable characters it leaves out, but which really
>   need to end up in the NEVER class. we believe the
>   correct specification, expressed in terms of Unicode
>   regex, should be, simply:
>
>   [:Default_Ignorable_Code_Point:]
>
>   The reason to use Default_Ignorable_Code_Point, rather
>   than Other_Default_Ignorable_Code_Point, is that the
>   latter is merely a contributory property in the UCD --
>   its only function is to contribute to the stability
>   of the derivation of Default_Ignorable_Code_Point,
>   which itself is the relevant property. Furthermore,
>   Default_Ignorable_Code_Point automatically picks up
>   control codes, noncharacters, variation selectors,
>   and surrogate code points, which also need to be
>   treated as Ignorables for the purposes of the IDN
>   table derivation, since they need to end up as NEVER.
>   It is simpler to simply get them all covered via
>   Default_Ignorable_Code_Point, rather than to have to
>   define more categories separately for them.
>
>*************************************************************
>
>2.1.5. Category E - Historical Scripts
>
>   E: script(cp) in {Cuneiform Ugaritic, Old_Persian, Gothic,
>                     Old_Italic, Cypriot, Linear_B, Phoenican,
>                     Kharoshthi, Phags_Pa, Glagolitic, Shavian,
>                     Deseret, Osmanya, Ogham}
>
>   A meaningful and mnemonic label:
>
>   Historic_Scripts
>
>   We think this category definition is fine in general, but
>   needs to be extended to cover the additional scripts
>   from Unicode 5.1, as well as a few more of the scripts
>   already encoded as of Unicode 5.0 that are essentially
>   historical (although not ancient) and which have very,
>   very limited current use.
>
>   The following shows the best list we've been able to get
>   consensus on to date, expressed in Unicode Regex. We've
>   interspersed comments and subcategorized, to help
>   clarify the reasons for each in the list.
>
>   /* Really, really long-dead scripts */
>
>   [:script=Cari:] Carian
>   [:script=Cprt:] Cypriot
>   [:script=Glag:] Glagolitic
>   [:script=Goth:] Gothic
>   [:script=Khar:] Kharoshthi
>   [:script=Ital:] Old Italic
>   [:script=Linb:] Linear-B
>   [:script=Lyci:] Lycian
>   [:script=Lydi:] Lydian
>   [:script=Phag:] Phags-pa
>   [:script=Phnx:] Phoenician
>   [:script=Ugar:] Ugaritic
>   [:script=Xpeo:] Old Persian
>   [:script=Xsux:] Sumero-Akkadian Cuneiform
>
>   /* Dead, with minor current use, but not appropriate for IDNs */
>
>   [:script=Ogam:] Ogham
>   [:script=Runr:] Runic
>
>   /* Recently created, but no significant use */
>
>   [:script=Dsrt:] Deseret
>   [:script=Osma:] Osmanya
>   [:script=Shaw:] Shavian
>
>   /* Historic, with current use primarily as liturgical scripts */
>
>   [:script=Copt:] Coptic
>   [:script=Syrc:] Syriac
>
>   /* Historic minority scripts with small communities, little use */
>
>   [:script=Bugi:] Buginese
>   [:script=Buhd:] Buhid
>   [:script=Hano:] Hanunoo
>   [:script=Rjng:] Rejang
>   [:script=Sund:] Sundanese
>   [:script=Sylo:] Syloti Nagri
>   [:script=Tagb:] Tagbanwa
>   [:script=Tglg:] Tagalog
>
>   Keep in mind that if Historic_Scripts as a category is used
>   in the table derivation to result in MAYBE, rather than NEVER
>   status, then it is appropriate to expand the list to include
>   rather more of the lesser used scripts, because this decision
>   doesn't amount to an irrevocable prevention of all possible
>   future use in IDNs for the script.
>
>*************************************************************
>
>2.1.6. Category F - Blocks of Characters
>
>   F: block(cp) in {Combining_Diacritical_Marks_for_Symbols,
>                    Musical_Symbols, Ancient_Greek_Musical_Notation}
>
>   A meaningful and mnemonic label:
>
>   Inappropriate_Blocks
>
>   This definition seems fine, except that further investigation
>   has turned up a reason for including one more recently
>   encoded block that contains a combining mark, the Phaistos_Disc
>   block. So expressed in Unicode regex, this would be:
>
>   [[:block=Combining_Diacritical_Marks_for_Symbols:]
>    [:block=Musical_Symbols:]
>    [:block=Ancient_Greek_Musical_Notation:]
>    [:block=Phaistos_Disc:]]
>
>   Note that the only reason for needing this explicit list
>   of blocks is because they contain some combining marks
>   that otherwise would end up in the ALWAYS class. An
>   equivalent way of handling those would be to simply list
>   those ranges of combining marks explicitly for
>   the Exceptional_NEVER_List (see below).
>
>*************************************************************
>
>2.2.1. Category G - ASCII LDH
>
>   G: cp is in {0061..007A, 0030..0039, 002D}
>
>   A meaningful and mnemonic label:
>
>   ASCII_LDH
>
>   Expressed in Unicode regex:
>
>   [\u0061-\u007A\u0030-\u0039\u002D]
>
>   Note that the only techinal reason to define this as a
>   category for the table derivation is to get
>   U+002D HYPHEN-MINUS in the ALWAYS class, since the
>   letters and digits are already covered by the
>   handling of Letters_Digits_Marks. U+002D could just
>   as simply be included in the Exceptional_ALWAYS_List
>   below instead, with the same effect for the table
>   derivation.
>
>*************************************************************
>
>2.2.2. Category H - Exceptions
>
>   H: cp in {00B7, 05F3, 05F4, [3005,] 3007, 303B, 30FB}
>
>   The draft currently provides a subtable specifying
>   which of these end up with the value ALWAYS and
>   which MAYBE (YES). We think it is much more straightforward
>   to simply define this as an exceptional inclusion list
>   for the ALWAYS class.
>
>   If consensus among the idna-update group is that some of
>   these don't actually belong in ALWAYS, then we can simply
>   remove them from this list before the specification is
>   final, and they will end up in NEVER by their general
>   category as punctuation.
>
>   A meaningful and mnemonic label:
>
>   Exceptional_ALWAYS_List
>
>   Expressed in Unicode regex:
>
>   [\u00B7\u05F3\u05F4\u3007\u30FB]
>
>   Note that U+3005 and U+303B don't need to be included
>   specifically in this list, as their general category Lm
>   already results in the correct derivation for them.
>
>   Inclusion of U+3007 IDEOGRAPHIC NUMBER ZERO in ALWAYS
>   seems uncontroversial, so that really only leaves
>   the two middle dots and the Hebrew geresh/gershayim
>   to come to consensus about.
>
>*************************************************************
>
>2.2.3 Category I - CJK Subsetting
>
>   I: script(cp) is in {Han}
>
>   This category's only use in the draft is in the 3.1.2 step
>   in the derivation, but we believe it is not actually
>   required for the derivation, and should be omitted.
>
>   Instead, what would be useful is an exception category
>   comparable to Category H, but focussed on any required
>   exceptions for the NEVER class, so:
>
>2.2.3 Category I - Exceptions (2)
>
>   A meaningful and mnemonic label:
>
>   Exceptional_NEVER_List
>
>   Currently, this list has no elements. Its purpose would
>   be to stand as a placeholder in the table derivation,
>   for the potential (but unlikely) situation in the
>   future when an explicit exception could be needed
>   in order to keep the NEVER class backwards compatible.
>
>   Together, the Exceptional_ALWAYS_List and the
>   Exceptional_NEVER_List provide a mechanism for
>   keeping both the statement of the table derivation
>   and the class values ALWAYS and NEVER completely
>   backwards compatible.
>
>*************************************************************
>
>2.2.4 Category J - Character Groups Requiring Special Treatment
>
>   J: generalCategory(cp) is in {Cf}
>
>   The rationale for this category is to derive the CONTEXT
>   value in the table, but it is actually incorrect as stated.
>   Almost all of the Cf (format control) characters should
>   actually by Ignorables, resulting in a NEVER value for
>   them in the table. The actual membership of the class
>   of characters that should get the CONTEXT value consists
>   of just the two joiner characters.
>
>   So a meaningful and mnemonic label:
>
>   Join_Controls
>
>   And expressed as Unicode regex:
>
>   [:Join_Control:]
>
>*************************************************************
>
>2.2.5 Category K - Unassigned codepoints
>
>   K: cp is unassigned
>
>   A meaningful and mnemonic label:
>
>   Unassigned
>
>   And expressed as Unicode regex:
>
>   [:Cn:]-[:Noncharacter_Code_Point:]
>
>*************************************************************
>
>[Proposed new section]
>
>2.2.5 Category L - Modifier Symbols
>
>   These are characters used in in some transliteration methods
>   and linguistic notations. They are similar to modifier letters,
>   but are not currently known to be needed by any customary modern
>   orthographies. However, in some cases these characters come to be
>   used as the part of normal orthographies for a language, and thus
>   may change to Letter Modifiers. To allow for this possibility,
>   they need to be called out separately in the table derivation,
>   so they can end up in the MAYBE class, instead of NEVER.
>
>   A meaningful and mnemonic label:
>
>   Modifier_Symbols
>
>   And expressed as Unicode regex:
>
>   [:Sk:]
>
>*************************************************************
>
>Now turning to the derivation of the values for the table
>itself, rather than having a stepwise algorithm, it is much
>simpler (and much easier to evaluate) if all of the above
>mnemonic labels are taken as Unicode set names. Then all
>of the required values are simply the result of simple
>set addition and subtraction operations.
>
>In particular, the table can be derived as follows:
>
>UNASSIGNED = Unassigned - Ignorables
>
>CONTEXT = Join_Controls
>
>NEVER   = ALL
>         - Unassigned
>         - Letters_Digits_Marks
>         - Modifier_Symbols
>         + Unstable_Under_NFKC
>         + Unstable_Under_Full_Case_Folding
>         + Ignorables
>         + Inappropriate_Blocks
>         + Exceptional_NEVER_List
>         - Exceptional_ALWAYS_List
>         - ASCII_LDH
>         - CONTEXT
>
>ALWAYS  = Letters_Digits_Marks
>         - NEVER
>         - Historic_Scripts
>         + Exceptional_ALWAYS_List
>         + ASCII_LDH
>         + CONTEXT
>
>MAYBE   = ALL - UNASSIGNED - NEVER - ALWAYS
>
>Notice all closely this formal statement parallels the
>informal statement of the intent of the various classes
>at the start of our message. We believe this is a
>Good Thing (tm), because it makes the formal statement
>of the table derivation easy to understand and to
>verify.
>
>If you derive this way, all the historic scripts end up
>in MAYBE (less whatever NEVER characters they contain,
>such as uppercase letters, characters unstable under NFKD,
>etc.).
>
>Also, the result of the table derivation is that
>Latin, Greek, Cyrillic, Han, plus all the other major-use
>current official scripts of the world end up in ALWAYS
>(again less whatever NEVER characters they contain).
>
>As for the NEVER class itself, we note that defined this
>way, it matches very closely the list currently derived
>in draft-faltstrom-idnabis-tables-03.txt. However, in
>examining the differences in detail, there are several
>errors apparent in the tables-03.txt NEVER assignment.
>In particular:
>
>   1. 01D6, 01D8, 01DA, 01DC, 01DF, 01E1, 01FB, 022B,
>      022D, and 0231 are all listed as NEVER, but should
>      not be. These are all precomposed letters with
>      two accents, and there appears to be an error
>      in the way NFKC(cp) was calculated for tables-03.txt.
>
>   2. 0345 and 037A are not listed as NEVER, but should be.
>      0345 is unstable under full case folding, and
>      NFKC(037A) != 037A -- so the latter also suggests
>      an error in the calculations of NFKC(cp).
>
>As long as the exception lists are carefully managed
>(and ideally, never get changed at all), the four
>classes, UNASSIGNED, NEVER, ALWAYS, and MAYBE
>create a partition of all Unicode code points. Note
>that the order of the set additions and set subtractions
>is significant in the above definitions, since set
>subtraction is not commutative.
>
>--Ken Whistler & Mark Davis

>At 16:10 17/12/2007, Paul Hoffman wrote:
>That's quite a long list of requirements on the Unicode Consortium. 
>Is there some agreement from them, even if informal, to meet each of 
>those requirement

At 09:19 17/12/2007, Patrik Fältström wrote:

>In the current tables document, the algorithm defined take for granted
>the following things are stable for all future versions of Unicode.
>I.e. if for a specific codepoint one of the things below is valid in
>Unicode version N, the same has to be valid in Unicode version N+1.
>What has to be stable is what leads to a codepoint be in NEVER or
>ALWAYS.
>
>1. Codepoint that is in one of the general categories {Ll, Lu, Lo, Nd,
>Lm, Mn, Mc} is not to be moved away from that category.
>
>2. If NFKC(cp) != cp, then it has to stay like that.
>
>(If NFKC(cp) == cp and script is {latin, greek, cyrillic}, then that
>is not to be changed either.)
>
>3. If casefold(cp) != cp, then it has to stay like that.
>
>(If casefold(cp) == cp and script is {latin, greek, cyrillic}, then
>that is not to be changed either.)
>
>4. If the codepoint has one of the properties
>{Other_Default_Ignorable_Code_Point, Noncharacter_Code_Point}, then
>that property is not to be removed from the codepoint.
>
>5. If the codepoint is in one of the scripts {Cuneiform, Ugaritic,
>Old_Persian, Gothic, Old_Italic, Cypriot, Linear_B, Phoenician,
>Kharoshthi, Phags_Pa, Glagolitic, Shavian, Deseret, Osmanya, Ogham},
>then the codepoint is not to be away from those script.
>
>6. If the codepoint is in one of the blocks
>{Combining_Diacritical_Marks_for_Symbols, Musical_Symbols,
>Ancient_Greek_Musical_Notation}, then the codepoint is not to be moved
>away from those blocks.
>
>7. If the codepoint is in one of the scripts {latin, greek, cyrillic,
>han}, then it is not to be removed from one of those scripts.
>
>8. If the codepoint is not in one of the scripts {latin, greek,
>cyrillic, han}, then it is not to be moved to one of those scripts.
>
>9. If the codepoint is one of {0061..007A, 0030..0039, 002D, 3005,
>3007}, then nothing of that codepoint can change.
>
>    Patrik