UTC Agenda Item: IDNA proposal

Thu Nov 23 02:03:37 CET 2006

--On Thursday, 23 November, 2006 13:20 +1300 Sam Vilain
<sam.vilain at catalyst.net.nz> wrote:

> One other question - how does the table for CJK/Han characters
> compare with the tables referred to in RFC3743 and RFC4713?

Consistent as far as we know, and I've asked the relevant folks
to cross-check it.  This won't go forward without any
incompatibilities in that area having been either resolved or
very clearly identified and explained.  I'm predicting
"resolved".

> I really think that you need to get linguists on the case here
> from around the globe, make sure they really understand the
> homograph issue, and get them to approve the tables for
> individual languages, along with providing a list of example
> words for that language demonstrating a representative portion
> (or, where possible, complete coverage) of the characters that
> are necessary.

If only because I'm in a more convenient time zone than Patrik
and because I seem to have fallen into the role of the person
who gets blamed anyway :-(...

I, and I think we, agree.  The problem is that, at least
sometimes (I'd suggest "often" from my experience), when one
encounters the serious linguists who study a particular language
and its writing system from the perspective of use in an area
whether that language dominates, they aren't very interested in
character-coding issues in particular and, often, computers in
general.

That situation often leads to one in which their first
encounters with this work lead to reactions that, at their most
extreme and adjusted for local cultural conventions, come out
sounding a lot like "Unicode is a complete botch and whomever
advised on my language was an incompetent fool".   Now that
reaction is, in my experience, almost never justified.  But it
requires some time and effort to get past and often lead to
places where no one really wants to go (e.g., demands to decide
which of a pair of "identical" characters that are assigned
separate code points should be excluded or insistence on
normalizations that are not supported by Unicode (often because
they are extremely language-specific)).  

The reason that most of the Indic scripts have been
_temporarily_ excluded is that the Indian Government and several
collections of the right sort of linguists are examining this
work and, unlike some of the generalizations above, doing so
with relatively full understanding of the constraints involved.
We would like them to advise us (and, ideally, UTC) on the right
set of handling requirements to meet their needs and that of the
DNS, so that rules can be established based on their advice,
rather than making up rules and then trying to fit either the
rules or their advice into each other in a style that would make
Procrustean beds feel comfortable and flexible.

> Ideally this would come down to each country to arrange, but
> how many of them are aware of this discussion group?  Ok, it's
> not being held four light years away on display in the bottom
> of a locked filing cabinet stuck in a disused lavatory with a
> sign on the door saying "Beware of the Leopard".  But,
> considering the language, social and socio-economic barriers
> involved, it may as well be.  How will the innovators in those
> countries feel when they eventually start looking at using
> their own script for domain names, but are told it won't work
> with all standards conformant browsers on the planet, because
> the people drafting the standards did not take the time to ask
> them how their script works?  This is I18N, not I13N¹...

For many, but certainly not all, languages, we do have some
contact with relevant parties via ccTLD administrations.  And we
are working on it (the more we know before some of us go to
ICANN in a week and a half and start talking with some of
_those_ people, the better off we will all be.

     regards,
        john