New version, draft-faltstrom-idnabis-tables-02.txt, available

Vint Cerf vint at google.com
Thu Jun 21 03:53:41 CEST 2007


It may be naïve of me to say so, but I think Ken makes a practical point
that we should go as far as seems clear and understandable with the general
property rules and then deal with a character by character exception list
where clarity and practicality are served.

Vint

 


Vinton G Cerf
Chief Internet Evangelist
Google
Regus Suite 384
13800 Coppermine Road
Herndon, VA 20171
 
+1 703 234-1823
+1 703-234-5822 (f)
 
vint at google.com
www.google.com
 

-----Original Message-----
From: idna-update-bounces at alvestrand.no
[mailto:idna-update-bounces at alvestrand.no] On Behalf Of Kenneth Whistler
Sent: Wednesday, June 20, 2007 7:23 PM
To: patrik at frobbit.se
Cc: idna-update at alvestrand.no; kenw at sybase.com
Subject: Re: New version, draft-faltstrom-idnabis-tables-02.txt, available

Patrik asked:

> On 20 jun 2007, at 00.46, Kenneth Whistler wrote:
> 
> >> 3007          ; Han # Nl       IDEOGRAPHIC NUMBER ZERO
> >
> > Appropriate for inclusion. This was separately discussed earlier on 
> > the list. The ideographic number zero is used with the Han 
> > ideographs for numbers to spell out numbers in radix 10 in Chinese 
> > and Japanese (as opposed to the traditional Han number system, which 
> > doesn't use a zero). So for completeness, this character should be 
> > allowed. It needs an exception rule, because gc=Nl are otherwise 
> > omitted. ALWAYS is fine.
> 
> I am not at this stage prepared adding rules that explicitly map to 
> individual codepoints. Is what you say the only way of getting the 
> proper derived value to add such a rule?

Yes.

Well, for many characters you might be able to fiddle with enough properties
to get a derivation in terms of properties, without reference to code
points.

For example: (Script=Han) & (gc=Nl) & (Numeric_Value=0) would result in a
set with exactly one member, namely: {3007}

But fiddling with ad hoc combinations of character properties simply to
avoid mentioning exceptional code points strikes me as being obscurantist at
best.

What do you think is clearer:

Rule N: Han numeric letters with value of zero.

  (Script(cp)=Han) & (gc(cp)=Nl) & (Numeric_Value(cp)=0) 
       
  The rule is intended to include Han numeric letters
  with the value zero, commonly used in non-traditional
  CJK numeric expressions appropriate in identifiers,
  and will not be changed.
  
or:

U+3007 IDEOGRAPHIC NUMBER ZERO should be included in the
  inclusion set (value ALWAYS).
  
In other words, what I'm saying is that at a certain point trying to
continue finding rules for exceptions to other rules is counterproductive,
and it is easier to simply list a small set of exceptional code points and
be done with it.

The concern that you have been expressing that if you give up on the generic
rules as the *only* mechanism for defining the inclusion set and allow in a
few code points as exceptions to the rules, that the entire game is lost,
simply doesn't bear up under scrutiny, I think.

The exceptional cases, as far as identifiers and IDN labels are concerned,
were all added to the standard long ago, and are exceptional in part because
they have some kind of odd legacy status. But they are easy to identify, and
there aren't large numbers of them.

In fact Rule G - ASCII, in the current document, is
*already* simply an exception list provided for legacy reasons.

In my derivation of the IDN_Permitted data file, besides the ASCII exception
list, all that I am suggesting needs to be added by exception is the list:

000B7 gc=Po sc=Zyyy MIDDLE DOT
005F3 gc=Po sc=Hebr HEBREW PUNCTUATION GERESH
005F4 gc=Po sc=Hebr HEBREW PUNCTUATION GERSHAYIM 0200C gc=Cf sc=Qaai ZERO
WIDTH NON-JOINER 0200D gc=Cf sc=Qaai ZERO WIDTH JOINER
03007 gc=Nl sc=Hani IDEOGRAPHIC NUMBER ZERO 030FB gc=Po sc=Zyyy KATAKANA
MIDDLE DOT

And that is certainly a small enough list that we can examine each case in
whatever detail we choose before committing to a particular exception.

Also, I can guarantee you that 100% of the characters next added to Unicode
for Unicode 5.1 will be handled by the generic rules, and will not require
consideration for exceptional handling. Nobody is going to have to sit
around in committees examining all 2000+ additions one by one to see if each
is a candidate for adding to that elite list of seven there.

Is that an absolute, 100.0000% guarantee? Well, no.
We're human, and this is a human enterprise, and maybe
*something* will get added in the future that requires yet another
exception. But if and when that happens, the UTC simply adds it to the
IDN_Permitted list and explains one more exception to the derivation rules
that generate IDN_Permitted. No change whatsoever would be required for IDNA
specifications making use of IDN_Permitted (in stringprep or whatever else).

--Ken



_______________________________________________
Idna-update mailing list
Idna-update at alvestrand.no
http://www.alvestrand.no/mailman/listinfo/idna-update



More information about the Idna-update mailing list