Archaic scripts -- the Battle of Examples

Sat May 10 21:49:50 CEST 2008

--On Friday, May 09, 2008 1:37 PM -0700 Kenneth Whistler
<kenw at sybase.com> wrote:

> John said:
> 
>> > IMO, inclusion of cuneiform (and many other long-dead,
>> > ancient scripts for archaic writing systems) in IDNs is
>> > just silly.
>> 
>> We are having a battle of examples.  You (and Michel) are
>> picking examples that no one expects to see in IDNs and that,
>> on a script by script basis, no one cares about seeing in
>> IDNs. Cuneiform scripts or Linear-B (or, for that matter,
>> Linear-A) are clearly in that category. 
> 
> Then I fail to see why we are having this argument. If
> no one cares about seeing these in IDNs, then why have
> we withdrawn the historic script exclusion clause from
> the table derivation?
> 
> If people here want to make cases that *some* of the historic
> scripts that were listed in that clause (and in Table 4
> of UAX #31) are more problematical and that there is reason
> to *include* them in the listing of PVALID code points,
> then let's have that discussion, instead.

We may be getting close to converging (or at least understanding
why we are having this discussion).  At least I hope so.

Let's assume that there are three classes of scripts that are
candidates for having their letters and numbers be
Prootocol-Valid:

(1) Clearly "yes".  This group consists of contemporary scripts
with large numbers of active users and speakers of the relevant
languages.

(2) Clearly "no".  These scripts have been used by no one other
than scholarly investigators for thousands of years and, for
substantive reasons (such as examples about cuneiform and clay
tablets) are exceedingly improbable to ever be used (by anyone
else) in a meaningful way again.

(3) A gray area that is neither "clearly yes" nor "clearly no".

Now, I believe we have two separate substantive issues.  The
first is what should be in that gray area and the second is what
to do about it.

The second of these is tied up with how significant we expect
moving a script (a whole script, or at least the letters and
digits in it) from DISALLOWED to Protocol-Valid to be.  If we
conclude that it should be (and can be) very easy, then it
really doesn't make much difference what we do about the first.
Put differently, if making the move is easy, then we should
disallow all of the "clearly no" scripts _and_  all of the gray
area scripts until and unless someone comes forward to make a
strong case for permitting them.  

If we conclude that such moves are a fairly big deal, then the
answer is different.   It then makes much more sense to identify
the gray area carefully and then to allow all of the scripts in
it while recommending the zone administrators be very careful
about decisions to permit them.

I'm clearly a member of the "big deal" camp.  I'm anxious to see
DISALLOWED characters prohibited at lookup time. I believe that,
even were the WG to conclude that they should not be, at least
some implementations would check and prohibit them anyway as a
means of protecting users from evil that might dwell in unknown
(or known to be problematic) territory.  I recognize the many
comments that have been made about how often updates occur in
practice and how long they take.  Even if one ignores those
systems issues, getting critical mass for an IETF revision of
IDNA rules and tables to accommodate a script that was
previously classified as archaic and excluded is going to
inevitably be problematic.   I think users will press for (and
install) updates for characters that are relevant to them and,
conversely, if a particular user can't read or display Klingon,
she may not care very much if a Klingon URI is not resolvable.
And, following reasoning I've used elsewhere, I believe that, if
some national entity says "support that script in IDNs or don't
bother thinking about doing any further business in our
country", I assume that some vendors will update their
applications to permit the relevant script in a big hurry...
even if IDNA says "no, it is archaic and prohibited" (creating
interoperability problems).    But I don't see an easy and rapid
transition if applications are rejecting labels that contain
DISALLOWED characters and we suddenly decide that the script
containing those labels should be allowed after all.

So, to me, I want to see the letters and digits of any scripts
about which there is a plausible scenario for a demand for IDNA
use (and no evidence of harm) put into the "gray area" list and
then handled as "Protocol-Valid but registries advised to take
caution".   _Or_ I want to see any extra Category-group created
s.t., these scripts are prohibited for registration but
permitted (no explicitly checked for and prohibited) on lookup.
That category is essentially the old "MAYBE NO", with different
criteria.   It differs from "Protocol-Valid but registries
advised..." in that it would contain a clear prohibition on
registration (and hence more guidance for application/resolver
implementers about what to expect) rather than general advice
for zone administrators to exercise caution but do whatever they
think appropriate.

I assume no one wants to discuss that new, MAYBE NO-like,
category, but would be happy to think more about it if I'm wrong
and people do.

Now, given those constraints, what do I think belongs in the
gray area?   The answer, to me, isn't fine-tuning Table 4 of UAX
#31, just because we would either need some clear rules about
criteria for future decisions (if we had those rules, there
would be no gray area) nor is it a standing IETF committee to
negotiate with UTC about that table (if only because we don't do
standing committees well).   At the moment (subject to
refinement and more understanding and persuasion) my criterion
for the gray area would put a script there if it met either of
the following criteria:

	* Active use within the last several hundreds of years
	(as distinct from no use for thousands).  Again, we are
	agreed that use that is strictly scholarly-historic
	doesn't count, but, as a protective measure against
	needing to make many moves from DISALLOWED to
	Protocol-Valid and the possibility that the Internet and
	communication within communities of interest might
	actually reverse old trends, I'd like to say "jury still
	out" (or "can't be sure yet") on scripts that have
	fallen out of use in the last few hundred years.

	* An identifiahle active advocacy community who can make
	a claim for IDNA use now. 

FWIW, something like Runic would end up in the gray area under
either rule.  Cuneiform and Linear B would not, again under
either rule.

Two closing comments: 

-- The African scripts that I'm concerned about aren't the ones
that are identified and already coded into Unicode.  They are
pre-colonial scripts for existing languages that are in the
middle of controversies about whether they are real writing
systems or interpretative pictographs, with a lot of issues
about cultural identity and political correctness contaminating
those discussions.  Now, perhaps we don't need to worry about
them.  If they really are writing systems and are coded into
Unicode in the future, their code points are UNASSIGNED today
and perhaps we can count on the politics to keep them out of
Table 4 when they are coded. Or perhaps not.

-- If one moved that "few hundred year" criterion forward to,
say, now, then there are several languages and the associated
single-language scripts that ought to be excluded on the grounds
that they are close enough to "extinct" (a few dozens of living
speakers, all of them elderly, and even fewer people reading and
writing them) to be irrelevant to IDNs.  I don't think it wise
to go there but, if our criteria are "extinct" (by some
definition) and hence not relevant to or appropriate for IDNs, I
don't see much way to avoid it if we are going to be consistent. 

    john