Draft on IDN Tables in XML

Wed Mar 14 18:34:50 CET 2012

Ram,

Agreed.  Indeed, we quite deliberately made absolutely no
attempt to define a machine-readable format when the idea of
these tables was introduced.  One reason was that we saw no
chance of getting consensus on a format, partially for the
reasons Ram identifies.  But the more important reason was that
we didn't want to encourage anyone to casually import a table
without studying it and, indeed, having enough script (and
language?) expertise around to be sure that they understood it.  

I think many of us can remember the degree of upset and concern
that arose when one registry decided to allow a script that they
didn't understand well because they knew that a language that
used the script was used in that country.  It is not clear
whether importing a table that was equally poorly understood who
make that situation worse, but it seems to me it would be very
unlikely to make it better.

Jaap's recollection is correct and a machine-readable form would
be useful in some cases.  However, the risk of mindless import
of tables actually creates a tradeoff between "...Having a
machine readable format that allows the tables to be imported
and repurposed aids this greatly" and the possibility of real
harm.  

My sense would be that, especially if the intent is to capture
not just character lists but variant concepts, we would be much
better off with a list of characters and explanations.   I
believe that the character lists and explanations that we would
be likely to see fall into three categories:

(1) Rather short lists, under a hundred or so characters.   This
would be the case for most alphabetic scripts, including Latin.
It is not clear that a standardized table format helps with that
case because anyone trying to import a table is really going to
need considerable explanation about what is going on (e.g., the
ability to represent "oe" as a variant of o-dieresis does not
imply that importing that relationship would be wise). 

(2) Lists that involved a great deal of complexity because of
concerns about look-alike characters, characters that are
sometimes used interchangeably, or distinctly different uses
(and rendering styles) for the same script.   All three of those
situations have been illustrated in the ASIWG work on Arabic
(some of which is incorporated into the Arabic variant
information project team), but I have every confidence that
there are other scripts out there with one or more of those
issues (some would argue that the Japanese - Chinese situation
poses problems not much different from the [Western] Arabic -
Perso-Arabic one).  The complexity of those situations cannot be
easily represented in a simple table, whether in XML or
otherwise.

(3) Chinese (Han) script.   The tables there are large enough to
make automatic import useful, but the CNDC table requires a
specialized table because of the paired preferred variants and,
as we know, neither Japanese nor Korean require variant
treatment (although they may have other issues).  It would make
lots of sense to have a standardized machine-readable format for
the Chinese use of Han script, but that involves a different set
of issues than such a format for other scripts and may or may
not be the correct format for a Japanese or Korean table.  

As a general comment, it seems to me that people on this mailing
list and the technical and script communities have repeatedly
told ICANN, its staff, and various "constituencies" that Han
script is fundamentally different from the collection of
alphabet-phonetic scripts and that trying to treat them as the
same just yields one problem after another (for one group or the
other).  That advise has been consistently ignored in ICANN's
efforts, including this one.  I don't know if the reason is
determined ignorance (which I certainly wouldn't not expect from
Kim) or that the advice is politically inconvenient but, from my
point of view, even decision made on the basis that they were
really the same just makes ICANN look silly and puts the
predictability and stability of the Internet at risk.  

Certainly it would be more convenient if they were really the
same --same variant issues, same order magnitude of relevant
characters, same relationships of "character" to sounds and
meanings, and so on-- but the odds of everyone who thinks they
would like one being awarded a pony and the wherewithal to keep
it are much higher.

best,
   john

--On Wednesday, March 14, 2012 10:45 -0400 Ram Mohan
<rmohan at afilias.info> wrote:

> Kim,
> 
> I am not certain registries would want to use an
> automated/machine-readable mechanism for importing tables from
> other registry IDN implementations. Implementations vary
> widely from one registry to another, each IDN implementation
> often requires hand-verification of allowed code points
> combined with business policies.
> 
> 
> 
> Let's try an example:
> 
> CDNC regularly publishes and updates the set of valid
> codepoints in the Han script, along with contextual and
> variant generation rules/guidelines. Publishing those set of
> rules in this rich format would be useful; however, registries
> will still need to manually verify both the codepoints and the
> rules for variant generation, and create business rules for
> allocation and delegation of applied for strings.
> 
> 
> 
> Feels like a nice to have, not a must have.
> 
> 
> 
> -Ram
> 
> 
> 
> --------------------------------------------------------------
> -------------------
> 
> Ram Mohan
> 
> (o) +1.215.706.5700 x103  (m) +1.215.431.0958  (f)
> +1.215.706.5701
> 
> rmohan at afilias.info | Skype: gliderpilot30 | Twitter @rmohan123
> 
> --------------------------------------------------------------
> -------------------
> 
> 
> 
> *From:* Kim Davies [mailto:kim.davies at icann.org]
> *Sent:* Monday, March 12, 2012 9:24 AM
> *To:* Jaap Akkerhuis
> *Cc:* idna-update at alvestrand.no; Dillon, Chris; Abdulrahman I.
> ALGhadir; vip at icann.org
> *Subject:* Re: Draft on IDN Tables in XML
> 
> 
> 
> Hi Jaap,
> 
> 
> 
> On Mar 12, 2012, at 3:34 AM, Jaap Akkerhuis wrote:
> 
> 
> As far as I know, the idea of the tables have always been to
> provide a public central place where the registries could list
> which characters they support and which they don't in their
> registrer plolicies. It is an "for your information only"
> registry and meant for human consumption.
> 
> And Kim, do correct me if this isn't the case anymore.
> 
> 
> 
> It is certainly correct that the notion of tables is to
> publicly share registry policy as it relates to code points
> that are accepted for registration. I think, however, it is a
> bit beyond merely informational, as a key driver in publishing
> them has been to allow sharing and re-use by other registries.
> Having a machine readable format that allows the tables to be
> imported and repurposed aids this greatly.
> 
> 
> 
> One of the reasons I feel this is a good initiative is an
> increasing number of tables appear to be published as PDF
> files with various contextual rules described in paragraphs of
> normative text. It would be nice to reverse this trend and
> have a format rich enough that it can express most if not all
> registry policies in a common way using a set of agreeable
> primitives.
> 
> 
> 
> kim