idna-bis and Eszett

Wed Nov 28 12:08:18 CET 2007

At 19:36 07/11/27, John C Klensin wrote:
>(I've changed the subject line because the Sharp S / Eszett
>character has been mangled well beyond recognition as it has
>moved back and forth among mail systems -- something that should
>be a warning to all of us.)

Yes, my mailer is definitely one of the culprits (Japanese
version of Eudora on Windows, works for iso-2022-jp and
occasionally for the Japanese subset of utf-8).
I didn't even care to fix it because I assume everybody
on this list understands what's going on.

>--On Tuesday, 27 November, 2007 13:04 +0900 Martin Duerst
><duerst at it.aoyama.ac.jp> wrote:
>
>>>> Yes, that is a problem with the IRI spec.
>>... 
>> What I'm surprised is the lack of understanding and
>> responsibility when proposing making potentially wide-reaching
>> changes to a spec. What idnabis does is to change the rules

>> for non-ASCII domain names. Up to now, a sharp s in a domain
>> name was mapped to 'ss'. With idnabis, such a sharp s is
>> simply 'a user interface issue'.
>
>Martin,
>
>I'm not inclined to worry about whether this is a "problem with
>the IRI spec" or elsewhere.

Knowing how often you worry (and in many cases for good reasons)
about something when others don't, I'm almost relieved :-).

>But we clearly have a problem, or
>perhaps several interconnected ones.  I believe that the
>following statements are all true:
>
>(1) By using NFKC at both registration and lookup time, IDNA2003
>permits a large number of mappings to occur.   If end users have
>become dependent on those mappings for export and interchange,
>they are, to a greater or lesser extent, in trouble.

"they are in trouble" has to be qualified. They are in trouble
if idnabis goes the way currently laid out, and they are in
trouble because they trusted that the IETF wouldn't nilly-willy
change their specifications and protocols.

With respect to the 'K' part of NFKC, I have to personally agree
that it was a bad idea in the first place to use it (I remember
how I argued in detail against it in the design team and/or in
the WG), but I think we have to very clearly distinguish between
new protocol design and changes to an existing protocol.

Also, with respect to the 'K' part, apart from the full-width/
half-width collapsing applying to East Asian locales
(for which the 'K' part was put into IDNA2003), I think the cases
where it actually gets used are few and far between, although
I think Erik van der Poel or Mark Davis or somebody else from
Google found a case or two of an 'fi' ligature in a domain name/
URI/IRI.

>(2) In addition to the NFKC mappings, there are a few mappings
>that are moderately to IDNA, including the one of Eszett.  They
>raise all of the issues of (1), but have the further
>disadvantage (for compatibility purposes) of being fairly easy
>to type on keyboards designed for the countries/ languages that
>use those characters (the characters mapped out by NFKC are
>typically harder to type).

Yes, typically, with the exception of full-width Latin characters
in East-Asian locales.

The mappings that you don't mention are NFC mappings (well, they
are a subset of NFKC mappings, so maybe you subsumed them there,
and the character sequences to be mapped are also in general
difficult to type), and, more importantly, case mappings.
Upper-case is not at all difficult to type, quite to the contrary.

>(3) These mappings have been a source of user confusion and some
>confusion in systems trying to use IDNs are if they were
>ordinary domain names.   For example, one cannot get a character
>that is mapped out back from a reverse lookup.  While that
>raises no issues if, as specified, one compares only the ACE
>forms, users who attempts a visual comparison will be in more or
>less trouble, depending on how much they understand the writing
>system.  

I definitely understand that some users in Germany were confused
about how the eszett worked (or didn't, depending on how you look
at it). But I doubt that we reduce the confusion if we change the
rules now.

>(4) The larger registry operators who are handling IDNs are
>increasingly refusing to accept registrations in raw form,
>permitting only the ACE form or ToUnicode(ToASCII(string)) to be
>registered.  As far as they are concerned, there is no such
>thing as label containing Eszett, only labels containing the
>"ss" sequence.

I think that makes perfect sense. Just accepting an eszett
without showing the user that it will always be mapped to "ss"
would be pretending that something existed that in fact doesn't.

>(5)  There have been some user complaints and confusion about
>IDN mapping to and from the ACE form losing information.

I'm sure that there have been some user complaints about every
single big or small aspect of IDNA. This is not an area where
you can make everybody happy, which ever way you turn things.
So statements of the form "There have been complaints..."
just are met with a "so what?" from me. What we would need
is a quantification of these complaints, as well as a quantification
of the expected number of complaints that we got with a different
way of doing things, plus a quantification of complaints that
we will get because we change things.

>(6) There have been some complaints that Eszett cannot be
>actually stored in an IDN, i.e., preserved in conversions to and
>from the ACE form.

Apart from my comments above to "There have been some complaints..."
above, I can very well understand this. I remember explaining
that people wouldn't be happy with mapping eszett to ss to Paul
Hofmann in detail when I once visited him in Santa Cruz.

>(7) It appears that the standard orthographic rules about
>whether it is appropriate or desirable to replace Eszett with
>"ss" vary among German-speaking countries,

Yes indeed. Switzerland doesn't use eszett, because around the
start of the last century, it didn't fit on typewriters at the
same time as French accents.

>so there is less
>guidance from common practice than might appear at first glance.

The fact that some letters aren't much used in some countries
doesn't really count as an argument for whether it should be
available in IDNs or not. If we went by that rule, we would
not have many characters in domain names, probably not even
LDHs :-(.

>So...
>
>> Independent of whether mapping in idna2003 was a good idea
>> or not, what the above change does is to just leave some
>> domain names foat in the air.
>
>Well, it is not "independent", because the mappings have turned
>out to be problematic.

"There have been complaints..." or "have turned out to be problematic"
is not enough to just change everything. What we better be damn sure
about is that the new thing is significantly better, and that the
transition will be rather painless.

As we all know, domain names are not words. For good reasons, we try
to make a wide range of characters available in IDNs, but we know
that there are limits. As an example, an apostrophe isn't allowed
in US-ASCII domain names, and won't be in IDNs, even though
not only a few people might want a domain name like O'hare.com.
At some point, people will have to accept that there are some
limitations. At some point, people will actually accept these
limitations, people are in many ways much better at getting used
to things than computers.

Changing the rules in mid-game means that people get to complain
twice, and have to get used to things twice. Also, in addition
to getting the impression that they are using a system that's
somewhat suboptimal (and whichever way we go, there will be
some suboptimal parts/aspects), they also get the impression
that they are using a system that's constanly changing.
The later is in many ways more of a problem, because it
undermines people's trust.

>And, strictly speaking, no domain names
>are up in the air, only external presentation forms of domain
>names.  But, semantics technicalities aside,

IDNA2003 clearly said what would work on a browser or a simlar
client, independent of whether you call that presentation form
or not. Calling that a semantic technicality is useless hairsplitting.

>this situation
>represents a real and significant tradeoff and nothing is cast
>in stone.

Okay. That sounds a lot better than most of what I have read so
far in this thread.

>If it is better to map Eszett -> "ss" in the
>protocol than to reject it (at the protocol level) that can
>certainly be done.

I think that the first thing that we need is a reassessment of
the value of keeping the status quo versus 'improvements'.
In particular, how 'broken' does something have to be to
justify us 'fixing' it, and how do we assess that a fix
actually will make users happier overall.

>In making that decision, we do need to
>understand that there is a slippery slope between the mappings
>of (2) about and those of (1) and between either of those an
>alternate label separators, which actually introduced a
>conceptual bug into IDNA.

For the record, I never liked alternate label separators either.
If you carefully read the IRI spec, you might actually note
that it doesn't allow them :-).

>They are calling my plane; more later.

Looking forward to it.

Regards,   Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp