New version, draft-faltstrom-idnabis-tables-02.txt, available

Thu Jun 14 12:04:15 CEST 2007

--On Thursday, June 14, 2007 10:24 +0100 Gervase Markham 
<gerv at mozilla.org> wrote:

> Harald Alvestrand wrote:
>> Can you recommend specific scripts that you think should have
>> the  "Stable" status?
>
> No; it's not my area of expertise. I comment merely as an
> implementor, for whom the current list looks concerning.
>
>> The fact that the CJK scripts are in MAYBE YES is probably
>> the biggest  contributor to the sheer number of characters
>> there. But I have no idea  whether there are known issues
>> with them that should be solved first.
>
> Forgive my ignorance, but isn't this what RFC 3743 addresses?

Gerv,

The principle here is extreme caution, at least in the very near 
term.  The one thing we must not do if we are going to have any 
long-term stability at all is to put something into the "ok, 
permitted" category and then learn something significant and 
drop it back into "maybe" or "never".    There isn't any room 
for a "never mind" or "whoops" category.

My own guess about CJK, as a close observer of the processes, 
work, and thinking that produced RFCs 3743 and 4713, is that if 
the model of 3743 is followed using the tables that China 
explained and published with 4713 and the Japanese and Korean 
counterparts to those tables are used, CJK are perfectly safe.

But that raises three problems:

(i) like you, "we" are not willing to stand up and say that we 
understand those languages and the script well enough to say 
"this is sufficiently defined and ok".  We don't have the 
language expertise and know and hence need that assertion to 
come from the relevant community after it has checked the 3743 
model and tables against the new IDNA[bis] model.

(ii) at present, Patrik's table-generating principles are trying 
to work with CJK as a single script because that is the only 
handle that the Unicode properties and organizational structure 
provide.   RFC 3743 is a registration-side overlay that implies 
subsets of that script -- different subsets for Chinese, 
Japanese, and Korean.  We haven't figured out how to talk about 
that yet, e.g., to say "CJK is just fine as long as one follows 
the 3743 model, knows the language context at registration time, 
and has appropriate (linguistically and in terms of 
minimal-confusion) subset tables for use with 3743, but more 
general use of CJK is still 'maybe' at best".

I think such a statement is probably true, but am not expert 
enough to assert it (see (i), above), and, as I said, we haven't 
figured out how to assert that sort of subset/ conditional rule 
yet.  Perhaps we just need to say that, while a complete script 
might be "maybe", carefully-thought-out subsets of that script, 
handled so that confusables are controlled, might be just fine. 
Your opinion (and that of others) on that would be welcome.

(iii) It is also important to understand these categories in 
terms of how they are used.   From your standpoint as an 
implementer of a browser (or other applications that look up 
these names) there is probably no practical difference between 
"permitted", "maybe yes", and "maybe no".  You are expected to 
look up a string containing any of those characters to see if it 
resolves.  You might want to use the categories as indicators of 
strings that you want to alert the user about in some way, but 
they should not affect what you look up.  Registries are 
expected to stay much closer to the "permitted" list.   See my 
note on this list of Tuesday, June 12, 2007 18:01 -0400 for more 
on this; I won't repeat it here but will happily forward you a 
copy if it slipped past you.

Regards,
    john