New version, draft-faltstrom-idnabis-tables-02.txt, available

Kenneth Whistler kenw at sybase.com
Thu Jun 21 01:02:56 CEST 2007


Patrik suggested:

> On 20 jun 2007, at 00.46, Kenneth Whistler wrote:
> 
> > Appropriate for inclusion.
> 
> To make my (as editor of the document) life easier, when going  
> through codepoint examples, can people (not only you Ken, do not take  
> this personally please) use terminology that maps directly to the  
> tables document?
> 
> For example, I presume you with "Appropriate for inclusion" imply  
> that it is Appropriate with property value ALWAYS"?

O.k., I can do that where it helps. However,
in that particular note, what I was doing was providing
first the discursive answer, group by group, to Harald's
questions about the other Han characters. I then
resummarized toward the end very explicitly, with exact
lists of characters associated with ALWAYS, NEVER, and MAYBE.

[To keep the context for this together, I have copied that
summary section to the bottom of this note, as well.]

> And when you say "All of the CJK and Kangxi radicals are  
> inappropriate for IDN's
> because of their general category gc=So." you say "it is ok for all  
> CJK and Kangxi radicals to be in NEVER as they do not match rule A  
> (gc=So)".

If you look to the summary descriptions below, I tried to do that 
in a way that doesn't require having
to match up item by item in the justification section.

Restating it for you somewhat differently, in hopes of meeting
your criteria for checking on rules:

1. Subset A of Han characters (see below for the list)
   belong in ALWAYS. (i.e. IDN_Permitted=True)
   
   Almost all of those fall out directly from the rules,
   if you omit Rule H's restriction on Han. In other words,
   I consider those to constitute a disproof of Rule H
   as stated.
   
   Two are exceptions to rule derivation:
   
   a. U+3007 IDEOGRAPHIC NUMBER ZERO would be MAYBE NOT
      by the existing rules, but should be ALWAYS.
      
   b. U+303B VERTICAL IDEOGRAPHIC ITERATION MARK would
      be MAYBE YES by the existing rules, but removing
      the restriction implied by Rule H (as required
      to get the other characters to be ALWAYS) should not
      turn this character also to ALWAYS.
      
2. Subset B of Han characters (see below for the list)
   belong in NEVER. (i.e. IDN_Never=True)
   
   All of those follow directly from the application
   of the criteria Script=Han AND NFKC(cp) != cp.
   In other words, they are a proof of the correctness
   of Rule B.
   
3. Subset C of Han characters (see below for the list)
   belong in MAYBE.
   
   If the rules are corrected to get Subsets A and B
   correct for Han, then these fall out automatically as the
   residue that are neither ALWAYS nor NEVER.
   
> 
> I.e. I need to understand when you find bugs in the rule set, and  
> when you prove the rules are correct. And I appreciate all help in  
> doing that matching while you discuss.

I hope that helps clarify for you what I intended in
the analysis for the Han set.

--Ken

[Context copied from earlier note follows.]

[Subset A]

Han script characters that should be ALWAYS (i.e. IDN_Permitted=True):

3005         ; IDN_Permitted # Lm         IDEOGRAPHIC ITERATION MARK
3007         ; IDN_Permitted # Nl         IDEOGRAPHIC NUMBER ZERO
3400..4DB5   ; IDN_Permitted # Lo  [6582] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DB5
4E00..9FBB   ; IDN_Permitted # Lo [20924] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FBB
FA0E..FA0F   ; IDN_Permitted # Lo     [2] CJK COMPATIBILITY IDEOGRAPH-FA0E..CJK COMPATIBILITY IDEOGRAPH-FA0F
FA11         ; IDN_Permitted # Lo         CJK COMPATIBILITY IDEOGRAPH-FA11
FA13..FA14   ; IDN_Permitted # Lo     [2] CJK COMPATIBILITY IDEOGRAPH-FA13..CJK COMPATIBILITY IDEOGRAPH-FA14
FA1F         ; IDN_Permitted # Lo         CJK COMPATIBILITY IDEOGRAPH-FA1F
FA21         ; IDN_Permitted # Lo         CJK COMPATIBILITY IDEOGRAPH-FA21
FA23..FA24   ; IDN_Permitted # Lo     [2] CJK COMPATIBILITY IDEOGRAPH-FA23..CJK COMPATIBILITY IDEOGRAPH-FA24
FA27..FA29   ; IDN_Permitted # Lo     [3] CJK COMPATIBILITY IDEOGRAPH-FA27..CJK COMPATIBILITY IDEOGRAPH-FA29
20000..2A6D6 ; IDN_Permitted # Lo [42711] CJK UNIFIED IDEOGRAPH-20000..CJK UNIFIED IDEOGRAPH-2A6D6

Summary description of that set: all the CJK unified ideographs + U+3005 (those get
in by the generic rules), and one exceptional addition (U+3007) and one
exceptional removal (U+303B).

[Subset B]

Han script characters that should be NEVER (i.e. IDN_Never=True):

3038..303A   ; IDN_Never # Nl     [3] HANGZHOU NUMERAL TEN..HANGZHOU NUMERAL THIRTY
2E9F         ; IDN_Never # So         CJK RADICAL MOTHER
2EF3         ; IDN_Never # So         CJK RADICAL C-SIMPLIFIED TURTLE
2F00..2FD5   ; IDN_Never # So   [214] KANGXI RADICAL ONE..KANGXI RADICAL FLUTE
F900..FA0D   ; IDN_Never # Lo   [270] CJK COMPATIBILITY IDEOGRAPH-F900..CJK COMPATIBILITY IDEOGRAPH-FA0D
FA10         ; IDN_Never # Lo         CJK COMPATIBILITY IDEOGRAPH-FA10
FA12         ; IDN_Never # Lo         CJK COMPATIBILITY IDEOGRAPH-FA12
FA15..FA1E   ; IDN_Never # Lo    [10] CJK COMPATIBILITY IDEOGRAPH-FA15..CJK COMPATIBILITY IDEOGRAPH-FA1E
FA20         ; IDN_Never # Lo         CJK COMPATIBILITY IDEOGRAPH-FA20
FA22         ; IDN_Never # Lo         CJK COMPATIBILITY IDEOGRAPH-FA22
FA25..FA26   ; IDN_Never # Lo     [2] CJK COMPATIBILITY IDEOGRAPH-FA25..CJK COMPATIBILITY IDEOGRAPH-FA26
FA2A..FA2D   ; IDN_Never # Lo     [4] CJK COMPATIBILITY IDEOGRAPH-FA2A..CJK COMPATIBILITY IDEOGRAPH-FA2D
FA30..FA6A   ; IDN_Never # Lo    [59] CJK COMPATIBILITY IDEOGRAPH-FA30..CJK COMPATIBILITY IDEOGRAPH-FA6A
FA70..FAD9   ; IDN_Never # Lo   [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COMPATIBILITY IDEOGRAPH-FAD9
2F800..2FA1D ; IDN_Never # Lo   [542] CJK COMPATIBILITY IDEOGRAPH-2F800..CJK COMPATIBILITY IDEOGRAPH-2FA1D

Summary description of that set: all the Script=Han characters where NFKC(cp) != cp.

[Subset C}

Han script characters that should be MAYBE (i.e. IDN_Permitted=False & IDN_Never=False):

2E80..2E99    ; Han # So  [26] CJK RADICAL REPEAT..CJK RADICAL RAP
2E9B..2E9E    ; Han # So   [4] CJK RADICAL CHOKE..CJK RADICAL DEATH
2EA0..2EF2    ; Han # So  [89] CJK RADICAL CIVILIAN..CJK RADICAL J-SIMPLIFIED TURTLE
3021..3029    ; Han # Nl   [9] HANGZHOU NUMERAL ONE..HANGZHOU NUMERAL NINE
303B          ; Han # Lm       VERTICAL IDEOGRAPHIC ITERATION MARK

Summary description of that set: all the rest of the Script=Han characters not
included in either of the first two sets.




More information about the Idna-update mailing list