Archaic scripts (was: Re: New version: draft-ietf-idna-tables-01.txt)

Thu May 8 02:10:05 CEST 2008

--On Wednesday, 07 May, 2008 13:20 -0400 Andrew Sullivan
<ajs at commandprompt.com> wrote:

> On Wed, May 07, 2008 at 10:08:43AM -0700, Michel Suignard
> wrote:
>> 
>> Some of these plane 1 scripts add a huge number of characters
>> that are symbolic/pictorial in nature, because in many cases
>> they have not been fully deciphered. I really don't see the
>> point in allowing them in a identifier scheme such as IDN.
> 
> If I am right in my understanding of what DISALLOWED means and
> what the goals of IDNA2008 are, then I believe the right
> question is not, "What is the point of allowing them in the
> identifier scheme?" but, "What is the harm in allowing them in
> the identifier scheme?"  That is, the "default" position
> should be "allowed in".

I would have said that the "default" position for something that
is generally identified as a letter should be "allowed in"
(although there are lots of other reasons, such as case folding
and compatibility conditions) for excluding letters.  The
"default" for anything that is not a letter or number is "not
allowed".

Based on a couple of off-list conversations, let me explain what
would immediately change my mind about "default" inclusion of
archaic scripts.  Suppose that what we are really talking about
isn't "archaic" or "historic" or "dead" scripts (terms that may
have slightly different meaning in the normal world) but scripts
that, for lack of a better term, I'll describe as "uncertain".
An "uncertain" script has the property that there is uncertainty
about some of the characters.  That uncertainty might arise
because there is not a continuous chain of literate readers and
speakers or perhaps because there are residual decoding
uncertainties for other reasons.   While I hope it is unlikely
that we have any scripts that are both contemporary and
uncertain, certainly we have some uncertain historical scripts
(see Michel's comment about Linear-B as an example).  

A corollary of a script being "uncertain" is that identification
of characters may not represent sufficient consensus and
certainty that we can reliably distinguish between, e.g.,
letters and symbols or punctuation.  

Part of the definition originally used for "MAYBE" was "no
language community has so far come forward and identified or
verified rules".   We saw the archaic scripts as going into
"MAYBE NO" precisely because we never expected to see such a
community, but wanted to keep the pain level at a minimum if one
turned up.   The definition of "uncertain" above is much
narrower than the definitions for "MAYBE", but I think the
situation is much the same: "uncertain" implies that we just do
not know enough about the script, or at least about some
characters of the script, to allow it.

Note that some "historical" scripts are not "uncertain".
Deseret, for example, is almost certainly of only historic
interest.  I'd be a lot more surprised to see serious efforts to
use it in IDNs than, say, Phoenician.  But I don't think there
is any uncertainty at all about Deseret's characters or their
classifications (I suspect there isn't any about what Unicode
calls Phoenician either, but let's leave that aside).

Michel's argument seems to focus on what I am calling
"uncertain".  For example, he wrote: "...a huge number of
characters that are symbolic/pictorial in nature, because in
many cases they have not been fully deciphered...".

My conclusion is that, if we can identify them, there is
probably sufficient justification for excluding "uncertain"
scripts from IDNs, at least until and unless they become
certain.  Doing so raises some other issues that we probably
need to address.   One is that I wish that, if there are
significant numbers of characters that are actually uncertain
("not fully deciphered" or equivalent), Unicode had made up a
General_Category value of, e.g., Ln (for Letter_uNsure),
probably with special stability rules, rather than dropping them
into Lo... or that there was another property that expressed
degree of certainty.  Another is that few, if any, of these
scripts are completely undeciphered.  Some characters are
uncertain, other characters are reasonably or completely
certain.   If there are user communities (a different group than
"native first-language speaker communities") for any of these
languages, we could easily be asked to permit those characters
that are fully understood even while excluding those that are
not.

If we are going to go down the "exclude uncertain" path, or even
the "exclude historical" one, I think we need to think carefully
about whether it is appropriate to have a category that is
different from "DISALLOWED" in terms of what it takes to move
things from it to "PROTOCOL_VALID".   I don't know if we do or
don't, but I think we are obligated to ask that question in a
serious way.

Michel, one further observation (to save cluttering the list
with a separate note):

These scripts may be "historical".  They may be "archaic".  They
may belong in a script museum somewhere.   The people who used
them as their primary writing system may be long dead.  But the
scripts themselves are not "extinct".   Indeed,  the very fact
that someone felt motivated to get them into Unicode is strong
evidence against "extinct".

best,
   john