Tonus (was: Re: Casefolding Sigma (was: Re: IDNAbis Preprocessing Draft))

Wed Jan 30 13:09:50 CET 2008

Dear John,

Thank you for your answer although I see an attempt from your side to
equalize two mistakes: The tonus issue that made our life hard and the
casefolding sigma as it was in IDNA2003 which made life easier.

As you can understand, an outside viewer would propose to fix the mistake
that makes life hard and let be the mistake that makes life easy. Instead,
you propose to make life even harder. I am afraid this is not easily
acceptable.

To present to you some facts, today we have ~9,500 Greek IDNs in our
database. Of these IDNs 3,144 have the final sigma and they actually depend
on it.

My name in Greek is Βαγγέλης Σεγρεδάκης and I would like for myself the
domain name Βαγγελης-Σεγρεδάκης.gr. You propose I should instead register
the domain name Βαγγέλησ-Σεγρεδάκησ.gr which is my name misspelled, or even
worse Βαγγέλη-Σεγρεδάκη.gr, since the final sigma will not be available for
registration anymore. Most male names have the final sigma though.

Could you please explain to me if you personally would accept to register
the domain Joh-C-Klensi.com if some person or a committee of a protocol
decided that a special character should be abandoned for simplicity?

I would welcome your help to solve this before it becomes a major issue.

Best Regards,

Vaggelis Segredakis

-----Original Message-----
From: John C Klensin [mailto:klensin at jck.com] 
Sent: Monday, January 28, 2008 9:10 PM
To: Vaggelis Segredakis; patrik at frobbit.se
Cc: idna-update at alvestrand.no
Subject: Re: Tonus (was: Re: Casefolding Sigma (was: Re: IDNAbis
Preprocessing Draft))

Dear Vaggelis,

First of all, thanks for adding some local perspective and
sanity to this discussion.  I don't know if Patrik's views on
this will be the same as mine, but, since we have been working
on the IDNAbis definitions together, let me try to give at least
one view of the situation.

Some decisions were made in IDNA2003 that, in retrospect, were
very unfortunate.   As Paul Hoffman indicated in a recent note,
it is not useful to try to assign blame for that situation, but
we should accept that we have learned a great deal in the last
several years, including learning about problems we could not
have understood without some IDN deployment experience.   We are
trying to fix those where we can in IDNA200X, but some of the
possible fixes are constrained by the earlier decisions and the
problems that would occur if we tried to make an incompatible
change (or moved to a different prefix with different coding
rules).  There is some discussion on this subject in the version
of draft-klensin-idnabis-issues-06 that was posted a short time
ago and will be more in a note to the list that I'm now working
in.

Specific comments on these particular issues below.

--On Monday, 28 January, 2008 17:12 +0200 Vaggelis Segredakis
<segred at ics.forth.gr> wrote:

> 
> Dear Patrik,
> 
> I would like to comment on the emails of this list on the
> issue of the casefolding sigma.
> 
> In the initial protocol for .IDN domain names some of your
> colleagues chose to implement solutions that are against the
> normal use of the Greek language in the domain name
> identifiers. In Greek a word in lower case letters has the
> sign "tonos" on the punctuated letter of the word (e.g.
> δοκιμή). However, when this letter is written on capital
> letters we do not use tonos anymore (δοκιμή becomes
> ΔΟΚΙΜΗ). Let's put this example in the xn-- form of a
> domain name: you get "xn--jxalpdlp.gr" for the domain name
> "δοκιμή.gr" but you get "xn--pxagfdlp.gr" for the domain
> name "ΔΟΚΙΜΗ.gr".

The various mapping rules that contributed to the Nameprep/
Stringprep tables were intended to prevent situations such as
one in which lower and upper case strings --strings that users
of the relevant script consider the same-- from producing
different punycode strings.  The discussion above shows that the
model failed for Greek.   There is controversy about how
frequently such differences arise, but it is clear that there
are examples in several languages.  The problem with applying
such mappings the way that Nameprep anticipated is that they
need to be 100% right or users end up feeling that their
languages have been mishandled or are astonished by the results. 

Unicode actually anticipated some of the problems by specifying
that case mapping is to be used only for comparison and not for
replacing characters with their case-mapped "equivalents".
Unfortunately, there is no way to reflect that distinction in
IDNA since using case-mapped comparison, as Unicode apparently
intends, would have required DNS server modifications for IDNs.

The approach taken to this problem, and to many that are
functionally similar to it, in IDNA200X is that, if we cannot
get the mappings right in every case, we should do no mapping at
all, transferring responsibility for such mappings outside the
protocol, to "preprocessing" or the user interface.  The
language describing that step has been revised in the new
version of draft-klensin-idnabis-issues, but the original
treatment of it as a user interface issue reflected the hope
that, in the long term, the mappings would be dealt with as
close to the original point of creation of a reference to the
domain name as possible.

Now the specific implications of this with regard to your
examples above is that, in IDNA200X, there simply is no such
thing as an domain name label with characters in upper case.
Consequently, "ΔΟΚΙΜΗ.gr" becomes invalid as a domain
name, although you are free, if you see it on input, to map it
into either "δοκιμή.gr" and "xn--jxalpdlp.gr" or into
"δοκιμη.gr" and xn--pxagfdlp.gr" as suits your needs.
That is obviously not an ideal solution from your standpoint or
that of anyone else.  But it appears that is the best that it is
possible to do in this difficult situation... and that the right
solution is to educate all users of scripts that differentiate
cases to get used to the fact that IDNs exist only in lower case
and they should get used to not seeing or typing, e.g.,
ΔΟΚΙΜΗ.gr, because there is no way to guarantee that the
same thing will happen to it on all systems.  In IDNA2003, there
was also no such thing as an upper-case IDN domain name label as
stored in the DNS, but the fact was partially hidden from the
user.  That model produced results that were consistent but, as
in the case you cite, sometimes wrong.

> As a consequence of this fact, in order to implement the Greek
> language in the domain name space we had to use the solution
> of bundles and in many cases DNAME. You are welcome to check
> my presentation "IDNs in Greece"
> (http://www.icann.org/meetings/lisbon/presentation-idns-greece
> -27mar07.pdf) for our solution.

I have just skimmed this and will read it more carefully soon.
Unfortunately I (and I suspect others) had not seen it before or
the document drafts would have responded to the issues earlier.
But another reality that we (led by others) have discovered in
the last several years is that there are some issues that cannot
be dealt with effectively in a protocol, especially a
client-side one, at least without introducing information that
is not available in the DNS.   The only way to deal with them
effectively appears to be some type of restrictions on what a
registry permits to be registered.  The fact that we first
discovered problems of that type with Han-derived (CJK) scripts
is a historical accident.  The fact that the communities
involved could cooperate to produce the JET spec (RFC 3743) and
subsequent work is a testimonial to the fact that, with an
understanding of the issues, inspired leadership, and a
willingness to cooperate, representatives of the languages that
share a script can come together and develop principles that
work for everyone, or at least everyone who is willing to
cooperate. 

The solution of using variant bundles seems quite reasonable and
appropriate to me --A Good Thing and a demonstration of
exceptional responsibility on the part of the registry-- rather
than something you should be concerned about.   It is a bit of
extra work, but it definitely provides users additional
assurances against the consequences of ambiguity.

> Now let's come to this newly brought issue of the final sigma.
> Before this discussion we knew that the final sigma was
> bundled with the small letter sigma. If you tried a Greek IDN
> with the final sigma on each and every position you had a
> small letter sigma, it was equivalent to the same domain with
> a small sigma in every place. If you started from the xn--
> form on a browser, you would never get a final sigma but
> instead in all the positions you would have the small sigma.
> Not the best solution but it was an acceptable one since it
> allows the use of the final sigma as it is used in the Greek
> language and still does not create any phishing problem if you
> use it instead of the normal small sigma - they are
> interchangeable.

Yes, although it does mean that, unless the implementers of
software that displays IDNs makes adjustments, if a string
containing a final sigma goes into the DNS, one containing a
small sigma comes out, so 
   input-label is not equal to ToUnicode(ToASCII(input-label))

It is not obvious on first glance, but the exact same properties
of Nameprep that caused the difficulties you encountered with
tonus are the ones that permitted that treatment of final sigma.
At least without making many, many special cases for individual
characters, it is hard, if not impossible, to get both right.

> Some people on this list propose this should change. Can you
> please clarify your proposal on this issue and be as kind as
> to explain to us Greeks why the previous solution creates
> problems to your protocol?

The discussion was initiated by people who believed that final
sigma should be treated as a separate character from small sigma
so that the typographic form would be unambiguously preserved.
That proposal was made by people who claimed to have great
knowledge of Greek and to represent the entire linguistic
community on the subject.   Of course, that would subject you to
the same problem you have with tonus plus the problem that a
given string, resolved under IDNA2003, would produce a different
punycode string than if it were resolved under IDNA200X.   You
could resolve that problem only with variant (presumably
bundling) techniques and/or DNAMEs and would have to apply those
techniques retroactively, to all existing registrations
containing sigmas.

As to why the earlier treatment was a problem, it is not for
sigma but would introduce the notion of specific character
mappings into the new protocol, a process that works well when
it is correct and one is sufficiently familiar with the relevant
script but that can be astonishing or just plain wrong in
others.   In Greek, with the tonus and sigma cases, IDNA2003's
mappings demonstrate both cases.  You believe (I presume
correctly) that IDNA2003 got sigma right and tonus wrong.   The
people who have been speaking for you and the linguistic
community claim that the handling of sigma was wrong.  I presume
that, were the discussion to continue long enough, we would find
someone insisting that the handling of tonus was correct as it
appears in IDNA2003.

It appears that, globally, the right thing to do is to move
toward having only one official, interchange, pair of forms of
an IDN label.  That form should basically be the ones for which,
using IDNA2003's terminology for the operations,

    label == ToUnicode(ToASCII(label)) 

We need to get ourselves and the users used to those forms and,
in the interim (and as appropriate, for much longer) perform
whatever mappings are needed outside the protocol to maximize
compatibility with existing IDNA2003 labels and what the user
would construe as good sense.  IDNA gives more flexibility to do
that in a linguistically-correct way, not less.  For example, I
believe that IDNA2003 permits only one form of output and
display as the result of 
    ToUnicode(label-in-punycode-encoding)

while IDNA200X permits the display of an A-label that is
converted to a U-label in any form that the relevant software
would permit to be mapped into the U-label.  We encourage that
only if great care is exercised about where and how it is done
because the mapping-dependent display forms may not be usable in
other contexts.  But it should be done where it makes sense.

> I thought that the IDNs would be implemented to solve the
> language barrier problems in the use of internet. Instead I
> find out that the correct use of a language is not a priority
> if this happens to create exceptions in the protocol you are
> trying to propose. I am afraid that we are facing a problem
> there as science should make life easier to the people,
> instead of requiring people to adopt their language to
> protocols, which has been the case for many non-latin-alphabet
> people.

First of all, because they are client-side extensions to the
DNS, IDNs may be able to solve script barrier problems.  Actual
language barrier problems require solutions elsewhere.  The
distinction between the two is important.  Second, we know (at
least now) that the only optimal way to solve many of these
problems requires fundamental DNS protocol and server changes so
that information can be preserved in the labels while matching
labels that are not bit-string-identical on the server.  That
is, of course, exactly what is done for lower and upper case
comparisons in ASCII labels.  Because of likely deployment
problems and other complications, the community has rejected
solutions that depend on DNS server changes several times.  I
believe that, given the opportunity, such solutions would be
rejected again.   Finally, while we could, I think, do better
(or more consistently) than what we are doing now at the cost of
making significant incompatible changes between IDNA2003 and
IDNA200X (probably requiring a new prefix; see the discussion in
the new version of the "issues" document) we do not believe that
the community is willing to accept the costs of such changes.
Perhaps we are wrong about the latter.

> If there seems to be something that needs some straighten up,
> it is this tonos hyphenation problem which is very serious for
> us and not this final sigma issue. I would welcome your
> proposals on this serious issue.

See above.  Of course, there may be other ways to handle this
that would work better for all concerned.  If you have such
suggestions, please make them.

best regards,
   john