New version, draft-faltstrom-idnabis-tables-02.txt, available

Wed Jun 20 00:46:42 CEST 2007

Speaking specifically to Harald's followup questions about Han...

Executive summary:

If you remove the effect of Rule H in the current draft,
by postulating Han to be a "stable script", then the net effect of the
current draft would be to sort Han out pretty much as I show below, 
except for U+3007 and U+303B (which I note below as being exceptional).

> Martin Duerst wrote:
> > Okay, let's give it one try.
> >
> > The CJK Unified Ideographs block is stable. ...

> > Is a statement like the above what you are looking for?
> >   
> That's exactly the kind of statement I'm looking for, at least.
> 
> Checking: By "CJK Unified Ideographs", you mean the range 4E00..9FFF, as 
> described in Blocks.txt from the Unicode database version 5.0.0.

That is correct.

>  From my reading of the "Scripts.txt" file, these have the Script 
> property of "Han".
> 
> You are not willing to speak for the following ranges also in the script 
> "Han":

Martin may also weigh in on this, but I would be glad to explain
in detail the status of each of the ranges Harald asked about.

> 
> 2E80..2E99    ; Han # So  [26] CJK RADICAL REPEAT..CJK RADICAL RAP
> 2E9B..2EF3    ; Han # So  [89] CJK RADICAL CHOKE..CJK RADICAL 
> C-SIMPLIFIED TURTLE
> 2F00..2FD5    ; Han # So [214] KANGXI RADICAL ONE..KANGXI RADICAL FLUTE

All of the CJK and Kangxi radicals are inappropriate for IDN's
because of their general category gc=So. They are also unnecessary
for representation of ordinary Japanese, Chinese, or Korean
text. Their inclusion in the standard is to a) enable the
representation of special text such as the radical indices
in dictionaries, and b) enable the description of Han ideographs
via the ideographic description sequence mechanism.

So definitely not ALWAYS. Furthermore, some of them should be
NEVER, because they are unstable under NFKC(cp).

> 3005          ; Han # Lm       IDEOGRAPHIC ITERATION MARK

Appropriate for inclusion. ALWAYS is fine. gc=Lm is not inconsistent
with that.

> 3007          ; Han # Nl       IDEOGRAPHIC NUMBER ZERO

Appropriate for inclusion. This was separately discussed earlier
on the list. The ideographic number zero is used with the Han
ideographs for numbers to spell out numbers in radix 10 in
Chinese and Japanese (as opposed to the traditional Han number
system, which doesn't use a zero). So for completeness, this
character should be allowed. It needs an exception rule, because
gc=Nl are otherwise omitted. ALWAYS is fine.

> 3021..3029    ; Han # Nl   [9] HANGZHOU NUMERAL ONE..HANGZHOU NUMERAL NINE

Not needed. These are shop sign variant forms of ordinary Han
numerals. Already excluded by category. But these don't need to
be NEVER.

> 3038..303A    ; Han # Nl   [3] HANGZHOU NUMERAL TEN..HANGZHOU NUMERAL THIRTY

These are also not needed. However, they need to be NEVER because
they are unstable under NFKC(cp).

> 303B          ; Han # Lm       VERTICAL IDEOGRAPHIC ITERATION MARK

Not needed. This is a form used only in vertically displayed
Han text, and is omitted for the same reason that the
vertical kana repeat marks 3031..3035 should be omitted.
No need to be NEVER, though.

> 3400..4DB5    ; Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED 
> IDEOGRAPH-4DB5
> 20000..2A6D6  ; Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000..CJK 
> UNIFIED IDEOGRAPH-2A6D6

These all notionally have the same status as the 4E00..9FFF range.
They are mostly rarer than the 4E00..9FFF range, but there are
occasional characters scattered in there which were added for
interoperability with other important East Asian character encodings.
In my opinion, the cleanest solution is to treat all of the
unified ideographs equally in terms of the tables for IDNA.
Then registries could choose to limit what they support to
common-use or country-specific subsets if they want to.

In terms of the IDNA protocol itself, the unified ideographs
all have the same status, and there is nothing unstable about them.

> F900..FA2D    ; Han # Lo [302] CJK COMPATIBILITY IDEOGRAPH-F900..CJK 
> COMPATIBILITY IDEOGRAPH-FA2D
> FA30..FA6A    ; Han # Lo  [59] CJK COMPATIBILITY IDEOGRAPH-FA30..CJK 
> COMPATIBILITY IDEOGRAPH-FA6A
> FA70..FAD9    ; Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK 
> COMPATIBILITY IDEOGRAPH-FAD9
> 2F800..2FA1D  ; Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800..CJK 
> COMPATIBILITY IDEOGRAPH-2FA1D

Almost all of the compatibility ideographs must be NEVER, because
they are unstable under NFKC(cp), by design. They have canonical
mappings to the unified ideograph they are equivalent to.

However, it is very important to realize that there are 12
exceptions in the F900..FA2D range, which are encoded in
the compatibility block, but which are actually *unified*
CJK ideographs, and must be treated as such. Those 12 are stable,
and need to be included in IDNs, because they are important for
mapping to some IBM sets.

These conclusions were summarized in the draft tables I posted
up for IDN_Permitted and IDN_Never properties. Resummarizing
in terms of the Han script specifically, and in terms of
the category values in draft-faltstrom-idnabis-tables-02.txt,
this would be:

Han script characters that should be ALWAYS (i.e. IDN_Permitted=True):

3005         ; IDN_Permitted # Lm         IDEOGRAPHIC ITERATION MARK
3007         ; IDN_Permitted # Nl         IDEOGRAPHIC NUMBER ZERO
3400..4DB5   ; IDN_Permitted # Lo  [6582] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DB5
4E00..9FBB   ; IDN_Permitted # Lo [20924] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FBB
FA0E..FA0F   ; IDN_Permitted # Lo     [2] CJK COMPATIBILITY IDEOGRAPH-FA0E..CJK COMPATIBILITY IDEOGRAPH-FA0F
FA11         ; IDN_Permitted # Lo         CJK COMPATIBILITY IDEOGRAPH-FA11
FA13..FA14   ; IDN_Permitted # Lo     [2] CJK COMPATIBILITY IDEOGRAPH-FA13..CJK COMPATIBILITY IDEOGRAPH-FA14
FA1F         ; IDN_Permitted # Lo         CJK COMPATIBILITY IDEOGRAPH-FA1F
FA21         ; IDN_Permitted # Lo         CJK COMPATIBILITY IDEOGRAPH-FA21
FA23..FA24   ; IDN_Permitted # Lo     [2] CJK COMPATIBILITY IDEOGRAPH-FA23..CJK COMPATIBILITY IDEOGRAPH-FA24
FA27..FA29   ; IDN_Permitted # Lo     [3] CJK COMPATIBILITY IDEOGRAPH-FA27..CJK COMPATIBILITY IDEOGRAPH-FA29
20000..2A6D6 ; IDN_Permitted # Lo [42711] CJK UNIFIED IDEOGRAPH-20000..CJK UNIFIED IDEOGRAPH-2A6D6

Summary description of that set: all the CJK unified ideographs + U+3005 (those get
in by the generic rules), and one exceptional addition (U+3007) and one
exceptional removal (U+303B).

Han script characters that should be NEVER (i.e. IDN_Never=True):

3038..303A   ; IDN_Never # Nl     [3] HANGZHOU NUMERAL TEN..HANGZHOU NUMERAL THIRTY
2E9F         ; IDN_Never # So         CJK RADICAL MOTHER
2EF3         ; IDN_Never # So         CJK RADICAL C-SIMPLIFIED TURTLE
2F00..2FD5   ; IDN_Never # So   [214] KANGXI RADICAL ONE..KANGXI RADICAL FLUTE
F900..FA0D   ; IDN_Never # Lo   [270] CJK COMPATIBILITY IDEOGRAPH-F900..CJK COMPATIBILITY IDEOGRAPH-FA0D
FA10         ; IDN_Never # Lo         CJK COMPATIBILITY IDEOGRAPH-FA10
FA12         ; IDN_Never # Lo         CJK COMPATIBILITY IDEOGRAPH-FA12
FA15..FA1E   ; IDN_Never # Lo    [10] CJK COMPATIBILITY IDEOGRAPH-FA15..CJK COMPATIBILITY IDEOGRAPH-FA1E
FA20         ; IDN_Never # Lo         CJK COMPATIBILITY IDEOGRAPH-FA20
FA22         ; IDN_Never # Lo         CJK COMPATIBILITY IDEOGRAPH-FA22
FA25..FA26   ; IDN_Never # Lo     [2] CJK COMPATIBILITY IDEOGRAPH-FA25..CJK COMPATIBILITY IDEOGRAPH-FA26
FA2A..FA2D   ; IDN_Never # Lo     [4] CJK COMPATIBILITY IDEOGRAPH-FA2A..CJK COMPATIBILITY IDEOGRAPH-FA2D
FA30..FA6A   ; IDN_Never # Lo    [59] CJK COMPATIBILITY IDEOGRAPH-FA30..CJK COMPATIBILITY IDEOGRAPH-FA6A
FA70..FAD9   ; IDN_Never # Lo   [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COMPATIBILITY IDEOGRAPH-FAD9
2F800..2FA1D ; IDN_Never # Lo   [542] CJK COMPATIBILITY IDEOGRAPH-2F800..CJK COMPATIBILITY IDEOGRAPH-2FA1D

Summary description of that set: all the Script=Han characters where NFKC(cp) != cp.

Han script characters that should be MAYBE (i.e. IDN_Permitted=False & IDN_Never=False):

2E80..2E99    ; Han # So  [26] CJK RADICAL REPEAT..CJK RADICAL RAP
2E9B..2E9E    ; Han # So   [4] CJK RADICAL CHOKE..CJK RADICAL DEATH
2EA0..2EF2    ; Han # So  [89] CJK RADICAL CIVILIAN..CJK RADICAL J-SIMPLIFIED TURTLE
3021..3029    ; Han # Nl   [9] HANGZHOU NUMERAL ONE..HANGZHOU NUMERAL NINE
303B          ; Han # Lm       VERTICAL IDEOGRAPHIC ITERATION MARK

Summary description of that set: all the rest of the Script=Han characters not
included in either of the first two sets.

> 
> (some of these may be eliminated by other rules in the current draft.)
> Correct?

Correct. See above. 

Furthermore, I would claim that Han clearly fits in the class of
"stable scripts" by Patrik's definition. There is absolutely NO
chance that future versions of Unicode would do anything to the
characters in the ALWAYS class that could throw them into the NEVER
category, nor is there any chance that any characters in the NEVER
category could ever be anything but NEVER. Note that the NEVER subset
of characters are all defined by NFKC(cp) != cp, and that relationship is bound
by the normalization stability guarantee. 

--Ken