New version, draft-faltstrom-idnabis-tables-02.txt, available

Thu Jun 21 01:23:05 CEST 2007

Patrik asked:

> On 20 jun 2007, at 00.46, Kenneth Whistler wrote:
> 
> >> 3007          ; Han # Nl       IDEOGRAPHIC NUMBER ZERO
> >
> > Appropriate for inclusion. This was separately discussed earlier
> > on the list. The ideographic number zero is used with the Han
> > ideographs for numbers to spell out numbers in radix 10 in
> > Chinese and Japanese (as opposed to the traditional Han number
> > system, which doesn't use a zero). So for completeness, this
> > character should be allowed. It needs an exception rule, because
> > gc=Nl are otherwise omitted. ALWAYS is fine.
> 
> I am not at this stage prepared adding rules that explicitly map to  
> individual codepoints. Is what you say the only way of getting the  
> proper derived value to add such a rule?

Yes.

Well, for many characters you might be able to fiddle with
enough properties to get a derivation in terms of properties,
without reference to code points.

For example: (Script=Han) & (gc=Nl) & (Numeric_Value=0) would
result in a set with exactly one member, namely: {3007}

But fiddling with ad hoc combinations of character properties
simply to avoid mentioning exceptional code points strikes me
as being obscurantist at best.

What do you think is clearer:

Rule N: Han numeric letters with value of zero.

  (Script(cp)=Han) & (gc(cp)=Nl) & (Numeric_Value(cp)=0) 

  The rule is intended to include Han numeric letters
  with the value zero, commonly used in non-traditional
  CJK numeric expressions appropriate in identifiers,
  and will not be changed.

or:

U+3007 IDEOGRAPHIC NUMBER ZERO should be included in the
  inclusion set (value ALWAYS).

In other words, what I'm saying is that at a certain point
trying to continue finding rules for exceptions to other
rules is counterproductive, and it is easier to simply
list a small set of exceptional code points and be done
with it.

The concern that you have been expressing that if you
give up on the generic rules as the *only* mechanism for
defining the inclusion set and allow in a few code points
as exceptions to the rules, that the entire game is
lost, simply doesn't bear up under scrutiny, I think.

The exceptional cases, as far as identifiers and IDN
labels are concerned, were all added to the standard
long ago, and are exceptional in part because they have
some kind of odd legacy status. But they are easy to
identify, and there aren't large numbers of them.

In fact Rule G - ASCII, in the current document, is
*already* simply an exception list provided for legacy
reasons.

In my derivation of the IDN_Permitted data file, besides
the ASCII exception list, all that I am suggesting needs
to be added by exception is the list:

000B7 gc=Po sc=Zyyy MIDDLE DOT
005F3 gc=Po sc=Hebr HEBREW PUNCTUATION GERESH
005F4 gc=Po sc=Hebr HEBREW PUNCTUATION GERSHAYIM
0200C gc=Cf sc=Qaai ZERO WIDTH NON-JOINER
0200D gc=Cf sc=Qaai ZERO WIDTH JOINER
03007 gc=Nl sc=Hani IDEOGRAPHIC NUMBER ZERO
030FB gc=Po sc=Zyyy KATAKANA MIDDLE DOT

And that is certainly a small enough list that we can
examine each case in whatever detail we choose before
committing to a particular exception.

Also, I can guarantee you that 100% of the characters
next added to Unicode for Unicode 5.1 will be handled by
the generic rules, and will not require consideration
for exceptional handling. Nobody is going to have to
sit around in committees examining all 2000+ additions
one by one to see if each is a candidate for adding
to that elite list of seven there.

Is that an absolute, 100.0000% guarantee? Well, no.
We're human, and this is a human enterprise, and maybe
*something* will get added in the future that requires
yet another exception. But if and when that happens,
the UTC simply adds it to the IDN_Permitted list and
explains one more exception to the derivation rules
that generate IDN_Permitted. No change whatsoever
would be required for IDNA specifications making use
of IDN_Permitted (in stringprep or whatever else).

--Ken