Casefolding Sigma (was: Re: IDNAbis Preprocessing Draft)

John C Klensin klensin at jck.com
Mon Jan 28 21:23:37 CET 2008



--On Monday, 28 January, 2008 11:25 -0800 Michel Suignard
<michelsu at windows.microsoft.com> wrote:

> John,
> 
>> From: John C Klensin
>> 
>>> --On Saturday, 26 January, 2008 18:57 +0900 Martin Duerst
>>> <duerst at it.aoyama.ac.jp> wrote:
>> 
>> 
>>> Are such short codes being considered at all? In parallel
>>> with full names? Or did you mean that Cypros is looking
>>> at getting the equivalent of Cs in Greek for their TLD?
>> 
>> Or the equivalent of Cypros in Greek characters for their
>> TLD. If you want to get seriously depressed (or amused,
>> depending...) you should examine the range of proposals for
>> IDN TLD names and how to formulate them that have been
>> circulated around ICANN.
> 
> Having written one of these proposals, I beg to disagree
> somehow. Using the iana list of the active cctlds, I took the
> country names in their native writing systems and was
> reasonably successful. The issues I had were:
> - lack of capital forms, but that is inherent of the lowercase
bias taken by IDN,
> - Cyprus, issue of the ending sigma, which has been discussed
a lot here
> - Maldives, because it ends with a combining mark
> - Sri Lanka, needs ZWJ to be correctly represented,
> - Myanmar, imperfect representation with Unicode 3.2 and
therefore IDNA2003

> Except for the Cyprus case, it is my understanding that the
> work as currently proposed in the new IDN200x would fix these
> issues.

Michel, 

It does and it doesn't.  First, among the various proposals,
yours is probably the most plausible from a technical
standpoint.  It may be among the least plausible from a
political one because countries really don't like being told
what to call themselves.  Some are, of course, more vocal on
that point than others and some are happier with the 3166-1
names (which are sometimes not very good translations to either
English or French of what they are called in their own
languages) than others.

As soon as we get to "names of things", especially countries,
even that inherent lower-case bias (to use your term) turns out
to be unacceptable to some people.  For people who want to
insist on orthographic correctness of those names, writing a
country name entirely in lower-case, or getting a lower-case
form back after an upper-case one has been provided on input,
has got to be at least as problematic as whatever is done about
the far more rare position-sensitive form issue.

I think we all need to accept the fact that there are some
things we can't do by substituting client-side mapping for
server-side matching.  Once we do that, we need to accept the
consequence that some of the edge cases (and even some
mainstream cases, like capitalization of titles) cannot be
handled in an optimal way.  Many of those sub-optimal cases will
subject us to accusations that we are trying to fit people's
languages into a Procrustean bed whose dimensions are dictated
by the technology we have chosen.

Unfortunately that accusation is true. We ultimately cannot
satisfy all linguistic requirements within the framework of the
DNS and IDNA (or any other entirely-client-side solution).   For
situations in which we need to be uncompromising about those
requirements, we need to move "above", or otherwise outside of,
the DNS.  Otherwise, this is all a matter of tradeoffs,
compromises, trying to arrive at solutions that work well long
term, acceptably short-term, and that doesn't cut any plausible
language-based set of mnemonics out entirely.  After that, it is
about education -- both about how the system works and what it
cannot be expected to do-- not technology. 

best,
   john




More information about the Idna-update mailing list