Table Derivation

Kenneth Whistler kenw at sybase.com
Fri Dec 21 22:51:34 CET 2007


Patrik,

In an effort to try to make the detailed technical feedback
on the table derivation in draft-faltstrom-idnabis-tables-03.txt
as straightforward as possible, while still reflecting what
we think are necessary corrections to the various categories
and derivation rules, we have written out a category-by-category
listing, with updates, following the lettering and order in
the draft, followed by a suggested simplification of the
required derivation rules.

First, to guide the discussion, we will try to capture what we
think is the *intent* of the main property values proposed
in the draft: ALWAYS, NEVER, MAYBE, CONTEXT, and UNASSIGNED.

NEVER consists of those characters that we want to categorically
rule out for IDNs, and should include:
   * characters that are neither letters, marks, nor digits
   * characters unstable under NFKC normalization
   * characters unstable under full case folding
   * default-ignorable characters (including control
       characters, noncharacters, variation selectors, etc.)
   * private-use characters
   * a short list of additional blocks not appropriate
       for IDNs

ALWAYS consists of those characters that we want to categorically
guarantee are available for IDNs (at the protocol level, although
there could always be additional restrictions at other levels),
and should include:
    * letters, digits, and combining marks
    * in particular, ASCII LDH
    * a small number of exceptional punctuation characters, 
        for various reasons, such as MIDDLE DOT
but should exclude:
    * anything in the NEVER category
    * a number of historic scripts for which there is no good
        argument currently to require them for IDNs
        
CONTEXT consists of those characters that are required for IDNs
*only* because of certain contextual rules, and which are not
otherwise already specified to be ALWAYS. These include only:
    * join control characters (U+200C ZWNJ, U+200D ZWJ)
    
UNASSIGNED consist of all Unicode code points not assigned
(in any particular version).
    
MAYBE consists of all other Unicode code points not determined
to be ALWAYS or NEVER or CONTEXT or UNASSIGNED, and in particular,
includes:
    * other assigned characters, including the historic scripts
    
If we can get general consensus that this is what we are trying
to accomplish with these values defined by the table for IDNA,
then it is possible to reexamine the proposed specific categories
used in the derivation of the table.

So, moving on to those, in order:

*************************************************************

2.1.1 Category A - Classes of Codepoints

  A: generalCategory(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}
  
  Let's give this category an actually meaningful and
  mnemonic label:
  
  Letters_Digits_Marks
  
  And in Unicode regex notation, using formal property
  values, this can be expressed as:
  
  [[:L:][:Nd:][:Mn:][:Mc:]]
  
  Note: "[:L:]" actually includes Ll, Lu, Lo, Lm, *and* Lt.
  As it turns out, all of the Lt and almost all of the Lu
  end up in NEVER because of the effect of full case folding,
  but it is simplest and most comprehensible to simply
  define Letters_Digits_Marks per se as including *all* the
  letters to start with.
           
*************************************************************

2.1.2 Category B - Normalization

  B: NFKC(cp) != cp
  
  As before, a meaningful and mnemonic label:
  
  Unstable_Under_NFKC
  
  Now, as stated, this is a condition that requires a
  conformant Unicode normalization API to check each
  code point and compute the result of the expression.
  However, it turns out that for this particular relation,
  there is a precomputed value stored in the Unicode
  Character Database files for each version of the
  standard. That is the derived NFKC_QuickCheck property,
  listed in DerivedNormalizationProps.txt in the UCD.
  And in Unicode regex notation, using formal property
  values, the condition we need can be expressed as:
  
  [:NFKC_QuickCheck=NO:]
  
  or
  
  [:NFKC_QC=N:]
  
  What this means is the the IDNA table derivation, for any
  particular version of the Unicode Standard, doesn't actually
  have to be implemented in code that depends on a conformant
  normalization API. It can simply get the results as
  a list from the data file, DerivedNormalizationProps.txt.

*************************************************************

2.1.3. Category C - Casefolding

  C: casefold(cp) != cp
  
  A meaningful and mnemonic label:
  
  Unstable_Under_Full_Case_Folding
  
  Computing this category also implies a Unicode API to check
  the case folding status of each character to calculate
  the result of the expression. However, there is a
  simple way to reliably get the correct result. If the
  character is listed in the UCD data file CaseFolding.txt,
  then it is unstable under full case folding -- precisely
  because it has an entry in that file specifying its 
  folding (to lowercase) to something other than itself.
  So computing the exact value of this category,
  Unstable_Under_Full_Case_Folding is simply a matter of
  extracting the list of code points in CaseFolding.txt.
  
  Unicode regex defines POSIX terminology for this as well,
  and for an implementation that supports Unicode regex,
  this is equivalent to the expression:
  
  [:^isCaseFolded:]
  
  ^isCaseFolded means the class of characters that
  are not already case folded, because if you case fold
  them, they turn into some other character (or sequence),
  i.e., it is equivalent to the class of characters that
  are Unstable_Under_Full_Case_Folding.

*************************************************************

2.1.4. Category D - Ignorables

  D: property(cp) is in {Other_Default_Ignorable_Code_Point,
                         Noncharacter_Code_Point}
                         
  A meaningful and mnemonic label:
  
  Ignorables
  
  The definition of Ignorables in 2.1.4 currently is
  problematical in a number of ways, because of the
  ignorable characters it leaves out, but which really
  need to end up in the NEVER class. we believe the
  correct specification, expressed in terms of Unicode
  regex, should be, simply:
  
  [:Default_Ignorable_Code_Point:]
  
  The reason to use Default_Ignorable_Code_Point, rather
  than Other_Default_Ignorable_Code_Point, is that the
  latter is merely a contributory property in the UCD --
  its only function is to contribute to the stability
  of the derivation of Default_Ignorable_Code_Point,
  which itself is the relevant property. Furthermore,
  Default_Ignorable_Code_Point automatically picks up
  control codes, noncharacters, variation selectors,
  and surrogate code points, which also need to be
  treated as Ignorables for the purposes of the IDN
  table derivation, since they need to end up as NEVER.
  It is simpler to simply get them all covered via
  Default_Ignorable_Code_Point, rather than to have to
  define more categories separately for them.
  
*************************************************************

2.1.5. Category E - Historical Scripts

  E: script(cp) in {Cuneiform Ugaritic, Old_Persian, Gothic,
                    Old_Italic, Cypriot, Linear_B, Phoenican,
                    Kharoshthi, Phags_Pa, Glagolitic, Shavian,
                    Deseret, Osmanya, Ogham}
                    
  A meaningful and mnemonic label:
  
  Historic_Scripts
  
  We think this category definition is fine in general, but
  needs to be extended to cover the additional scripts
  from Unicode 5.1, as well as a few more of the scripts
  already encoded as of Unicode 5.0 that are essentially
  historical (although not ancient) and which have very,
  very limited current use.
  
  The following shows the best list we've been able to get
  consensus on to date, expressed in Unicode Regex. We've
  interspersed comments and subcategorized, to help
  clarify the reasons for each in the list.
  
  /* Really, really long-dead scripts */

  [:script=Cari:] Carian
  [:script=Cprt:] Cypriot
  [:script=Glag:] Glagolitic
  [:script=Goth:] Gothic
  [:script=Khar:] Kharoshthi
  [:script=Ital:] Old Italic
  [:script=Linb:] Linear-B
  [:script=Lyci:] Lycian
  [:script=Lydi:] Lydian
  [:script=Phag:] Phags-pa
  [:script=Phnx:] Phoenician
  [:script=Ugar:] Ugaritic
  [:script=Xpeo:] Old Persian
  [:script=Xsux:] Sumero-Akkadian Cuneiform

  /* Dead, with minor current use, but not appropriate for IDNs */

  [:script=Ogam:] Ogham
  [:script=Runr:] Runic

  /* Recently created, but no significant use */

  [:script=Dsrt:] Deseret
  [:script=Osma:] Osmanya
  [:script=Shaw:] Shavian

  /* Historic, with current use primarily as liturgical scripts */

  [:script=Copt:] Coptic
  [:script=Syrc:] Syriac

  /* Historic minority scripts with small communities, little use */

  [:script=Bugi:] Buginese
  [:script=Buhd:] Buhid
  [:script=Hano:] Hanunoo
  [:script=Rjng:] Rejang
  [:script=Sund:] Sundanese
  [:script=Sylo:] Syloti Nagri
  [:script=Tagb:] Tagbanwa
  [:script=Tglg:] Tagalog

  Keep in mind that if Historic_Scripts as a category is used
  in the table derivation to result in MAYBE, rather than NEVER
  status, then it is appropriate to expand the list to include
  rather more of the lesser used scripts, because this decision
  doesn't amount to an irrevocable prevention of all possible
  future use in IDNs for the script.

*************************************************************

2.1.6. Category F - Blocks of Characters

  F: block(cp) in {Combining_Diacritical_Marks_for_Symbols,
                   Musical_Symbols, Ancient_Greek_Musical_Notation}
                   
  A meaningful and mnemonic label:
  
  Inappropriate_Blocks
  
  This definition seems fine, except that further investigation
  has turned up a reason for including one more recently
  encoded block that contains a combining mark, the Phaistos_Disc
  block. So expressed in Unicode regex, this would be:
  
  [[:block=Combining_Diacritical_Marks_for_Symbols:]
   [:block=Musical_Symbols:]
   [:block=Ancient_Greek_Musical_Notation:]
   [:block=Phaistos_Disc:]]
   
  Note that the only reason for needing this explicit list
  of blocks is because they contain some combining marks
  that otherwise would end up in the ALWAYS class. An
  equivalent way of handling those would be to simply list
  those ranges of combining marks explicitly for
  the Exceptional_NEVER_List (see below).

*************************************************************

2.2.1. Category G - ASCII LDH

  G: cp is in {0061..007A, 0030..0039, 002D}
  
  A meaningful and mnemonic label:
  
  ASCII_LDH
  
  Expressed in Unicode regex:
  
  [\u0061-\u007A\u0030-\u0039\u002D]
  
  Note that the only techinal reason to define this as a
  category for the table derivation is to get
  U+002D HYPHEN-MINUS in the ALWAYS class, since the
  letters and digits are already covered by the
  handling of Letters_Digits_Marks. U+002D could just
  as simply be included in the Exceptional_ALWAYS_List
  below instead, with the same effect for the table
  derivation.
  
*************************************************************

2.2.2. Category H - Exceptions

  H: cp in {00B7, 05F3, 05F4, [3005,] 3007, 303B, 30FB}
  
  The draft currently provides a subtable specifying
  which of these end up with the value ALWAYS and
  which MAYBE (YES). We think it is much more straightforward
  to simply define this as an exceptional inclusion list
  for the ALWAYS class.
  
  If consensus among the idna-update group is that some of 
  these don't actually belong in ALWAYS, then we can simply 
  remove them from this list before the specification is
  final, and they will end up in NEVER by their general
  category as punctuation. 
  
  A meaningful and mnemonic label:
  
  Exceptional_ALWAYS_List
  
  Expressed in Unicode regex:
  
  [\u00B7\u05F3\u05F4\u3007\u30FB]
  
  Note that U+3005 and U+303B don't need to be included
  specifically in this list, as their general category Lm
  already results in the correct derivation for them.
  
  Inclusion of U+3007 IDEOGRAPHIC NUMBER ZERO in ALWAYS
  seems uncontroversial, so that really only leaves
  the two middle dots and the Hebrew geresh/gershayim
  to come to consensus about.

*************************************************************

2.2.3 Category I - CJK Subsetting

  I: script(cp) is in {Han}
  
  This category's only use in the draft is in the 3.1.2 step
  in the derivation, but we believe it is not actually
  required for the derivation, and should be omitted.
  
  Instead, what would be useful is an exception category
  comparable to Category H, but focussed on any required
  exceptions for the NEVER class, so:
  
2.2.3 Category I - Exceptions (2)

  A meaningful and mnemonic label:
  
  Exceptional_NEVER_List
  
  Currently, this list has no elements. Its purpose would
  be to stand as a placeholder in the table derivation,
  for the potential (but unlikely) situation in the
  future when an explicit exception could be needed
  in order to keep the NEVER class backwards compatible.
  
  Together, the Exceptional_ALWAYS_List and the
  Exceptional_NEVER_List provide a mechanism for
  keeping both the statement of the table derivation
  and the class values ALWAYS and NEVER completely
  backwards compatible.
  
*************************************************************

2.2.4 Category J - Character Groups Requiring Special Treatment

  J: generalCategory(cp) is in {Cf}
  
  The rationale for this category is to derive the CONTEXT
  value in the table, but it is actually incorrect as stated.
  Almost all of the Cf (format control) characters should
  actually by Ignorables, resulting in a NEVER value for
  them in the table. The actual membership of the class
  of characters that should get the CONTEXT value consists
  of just the two joiner characters.
  
  So a meaningful and mnemonic label:
  
  Join_Controls
  
  And expressed as Unicode regex:
  
  [:Join_Control:]

*************************************************************

2.2.5 Category K - Unassigned codepoints

  K: cp is unassigned
  
  A meaningful and mnemonic label:
  
  Unassigned
  
  And expressed as Unicode regex:
  
  [:Cn:]-[:Noncharacter_Code_Point:]

*************************************************************

[Proposed new section]

2.2.5 Category L - Modifier Symbols

  These are characters used in in some transliteration methods
  and linguistic notations. They are similar to modifier letters,
  but are not currently known to be needed by any customary modern 
  orthographies. However, in some cases these characters come to be 
  used as the part of normal orthographies for a language, and thus 
  may change to Letter Modifiers. To allow for this possibility, 
  they need to be called out separately in the table derivation,
  so they can end up in the MAYBE class, instead of NEVER.

  A meaningful and mnemonic label:

  Modifier_Symbols

  And expressed as Unicode regex:

  [:Sk:]

*************************************************************

Now turning to the derivation of the values for the table
itself, rather than having a stepwise algorithm, it is much
simpler (and much easier to evaluate) if all of the above
mnemonic labels are taken as Unicode set names. Then all
of the required values are simply the result of simple
set addition and subtraction operations.

In particular, the table can be derived as follows:

UNASSIGNED = Unassigned - Ignorables

CONTEXT = Join_Controls

NEVER   = ALL
        - Unassigned
        - Letters_Digits_Marks
        - Modifier_Symbols
        + Unstable_Under_NFKC
        + Unstable_Under_Full_Case_Folding
        + Ignorables
        + Inappropriate_Blocks
        + Exceptional_NEVER_List
        - Exceptional_ALWAYS_List
        - ASCII_LDH
        - CONTEXT

ALWAYS  = Letters_Digits_Marks
        - NEVER
        - Historic_Scripts
        + Exceptional_ALWAYS_List
        + ASCII_LDH
        + CONTEXT

MAYBE   = ALL - UNASSIGNED - NEVER - ALWAYS

Notice all closely this formal statement parallels the
informal statement of the intent of the various classes
at the start of our message. We believe this is a
Good Thing (tm), because it makes the formal statement
of the table derivation easy to understand and to
verify.        
         
If you derive this way, all the historic scripts end up
in MAYBE (less whatever NEVER characters they contain,
such as uppercase letters, characters unstable under NFKD,
etc.).

Also, the result of the table derivation is that
Latin, Greek, Cyrillic, Han, plus all the other major-use
current official scripts of the world end up in ALWAYS
(again less whatever NEVER characters they contain).

As for the NEVER class itself, we note that defined this
way, it matches very closely the list currently derived
in draft-faltstrom-idnabis-tables-03.txt. However, in
examining the differences in detail, there are several
errors apparent in the tables-03.txt NEVER assignment.
In particular:

  1. 01D6, 01D8, 01DA, 01DC, 01DF, 01E1, 01FB, 022B,
     022D, and 0231 are all listed as NEVER, but should
     not be. These are all precomposed letters with
     two accents, and there appears to be an error
     in the way NFKC(cp) was calculated for tables-03.txt.
     
  2. 0345 and 037A are not listed as NEVER, but should be.
     0345 is unstable under full case folding, and
     NFKC(037A) != 037A -- so the latter also suggests
     an error in the calculations of NFKC(cp).

As long as the exception lists are carefully managed
(and ideally, never get changed at all), the four
classes, UNASSIGNED, NEVER, ALWAYS, and MAYBE
create a partition of all Unicode code points. Note
that the order of the set additions and set subtractions
is significant in the above definitions, since set
subtraction is not commutative.

--Ken Whistler & Mark Davis





More information about the Idna-update mailing list