CJK Incompatiblities (was: Re: Question about the agenda)

Kenneth Whistler kenw at sybase.com
Sat Mar 21 01:49:29 CET 2009


> >Perhaps we are simply reflecting a different
> >interpretation of "conclusions"?
> 
> Not really. The abstract of the JET draft says "[IDNA2008] will 
> cause incompatibilities for Chinese, Japanese and Korean (CJK) 
> scripts and languages." Section 3 of that draft gives a good 
> list of incompatibilities, none of which were listed in your 
> document. It does not seem fair to ask the WG "complete discussions,
> if necessary on IDNA2008 implications" while purposely ignoring
> some of the implications that have been brought to the WG's 
> attention, particularly those from major registries with a 
> lot of IDNA experience who spent the time to write them down 
> in an Internet Draft.

The incompatibilities noted in draft-jet-idnabis-cjk-localmapping-01
are a small subset of the incompatibilities noted and
discussed in:

http://www.unicode.org/reports/tr46/tr46-1.html

which we (the UTC), although not being a major registry,
have also spent the time to write down and bring to
the WG's attention.

To wit:

jet-idnabis-cjk-localmapping-01

3.1 Label separators

This deals with the well-known problem of the processing
conventions for U+3002 IDEOGRAPHIC FULL STOP and the
halfwidth and fullwidth versions as equivalent to "."
for label separators.

That is also accounted for in D-UTR #46.

3.2 Compatibility characters

That deals with the the fullwidth letters and digits and
the halfwidth katakana. Those are mapped in IDNA2003.
They are simply DISALLOWED and not mapped in IDNA2008.

The preprocessing mapping in D-UTR #46 accounts for those.

Either IDNA2008 lets casing and NFKC mapping back into
the protocol to eliminate this kind of incompatibility
(which is widespread now -- hence the perceived need
for "local mappings" such as that described in the
JET draft), or

IDNA2008 stands as is, without case and NFKC mapping,
in which case D-UTR #46 will likely turn into the
de facto standard for preprocessing to maintain maximal
compatibility with existing IDNA2003 processing. That
would also eliminate the need for a CJK-specific local
mapping for this particular issue.

3.3 Exceptions

U+3005 IDEOGRAPHIC ITERATION MARK (there is a code point error
                                   in the JET draft)
                                   
U+30FB KATAKANA MIDDLE DOT (there is a name error in the JET draft)

Those are CONTEXTO in the current tables document for IDNA,
rather than PVALID, so there are potential incompatibilities
where they might be valid in an IDNA2003 label that would
be disallowed under IDNA2008 A.10 and A.12 CONTEXTO rules
for the two characters, respectively.

I have no idea how those two ended up getting CONTEXTO
designations in the tables document -- I must have been
snoozing when that happened. U+3005 should just get
derived as PVALID by regular category derivation. It is
no more contextually constrained than several other
iteration marks that are PVALID in the table, such
as U+309D HIRAGANA ITERATION MARK. So that is simply
a mistake and an overabundance of misplaced caution for
the tables document. U+3005 --> PVALID and that problem
goes away.

U+30FB KATAKANA MIDDLE DOT needed to have an exception
for the derivation, since it is General_Category=Po
in the Unicode Character Database. But, in my opinion,
the right answer here is to specify that it is simply
PVALID, and to give up on the overspecification of
exactly where it has to occur in a label, which is
causing the incompatibility that the JET draft notes.
If the tables document is changed this way, then this
unnecessary incompatibility also goes away.

At that point, there are is only the very generic
issue of mapping left (one part of which is the
treatment of label separators, which is technically
outside the context of the definition of the labels
themselves, anyway).

--Ken



More information about the Idna-update mailing list