idna-bis and Eszett

Tue Nov 27 19:47:35 CET 2007

Just a reminder to whatever we do, some forms of normalization is required for the reasons I exposed in the idnaprep document I posted to this list a while ago. Extract:
<<
Note that although the restricted character repertoires are stable through normalization, the normalization step is still necessary. There are three reasons for this requirement:

1. combining sequences that are made of elements of the restricted repertoires normalized into composite characters (example, <U+0061, U+0301> (LATIN SMALL LETTER A followed by COMBINING ACUTE ACCENT) becoming U+00E1 LATIN SMALL LETTER A WITH ACUTE),
2. combining mark re-ordering,
3. Hangul Jamos/syllables composition.
The restricted repertoires are designed in a way that all versions of the Unicode normalization form C starting from Unicode 5.0 will provide the same result for those repertoires.

The restricted repertoires of an idnaprep profile cannot contain any character that changes value when normalized to normalization form C by itself. Additions to the restricted repertoires in idnaprep profiles for future Unicode versions MUST NOT include any character that changes value when normalized using NFC.

...

Note that other string preparations use the Unicode normalization form KC (NFKC) which maps many "compatibility characters" to their equivalent character. However, because the restricted repertoires are stable through normalization (i.e. NFKC(cp)=cp), in others words they exclude compatibility characters that could be mapped by NFKC, it is unnecessary to use NFKC for the normalization of the string in the context of idnaprep. This choice of NFC is also consistent with the recommendation for Internationalized Resource identifiers (IRIS) [RFC3987] (Duerst, M. and M. Suignard, "Internationalized Resource Identifiers (IRIs)," January 2005.). However, an application may still use NFKC to filter user input before applying idnaprep.
>>

In my opinion this process should be part of a revision of idn. I however agree with John's 7 statements below, especially the 4th one. We should really give preeminence to the ToUnicode(ToASCII(string)) form because it is the value that get registered through a reversible punycode transform and UI should favor that form. I have been convinced [reluctantly] that we could get away from including case folding in idn processing as long as the case folding is clearly described as an optional pre-processing step in the same document so that implementers do it consistently.

I don't think we can prevent IRI strings from containing 'unprocessed' host names but we should strongly advise implementers to only allow ToUnicode(ToASCII(string)) in host names when there is a strong evidence that such entity is used in an IRI (as Martin pointed out, they may appear in various parts of an IRI). But I don't think there is a magic solution to this issue. As long as some processing happens, either in the IDN protocol, or in higher level, advertisers and brand owners will be tempted to put out the more familiar names (mixed cases and all) and we will have to cope with it.

I would also like to see a new version of IRI reflect any new consensus on idnabis as the IRI spec currently has specific terms concerning IDNs which could be invalidated by idnabis as it stands today.

Michel

-----Original Message-----
From: idna-update-bounces at alvestrand.no [mailto:idna-update-bounces at alvestrand.no] On Behalf Of John C Klensin
Sent: Tuesday, November 27, 2007 2:36 AM
To: Martin Duerst; Harald Tveit Alvestrand; Thomas Roessler
Cc: Paul Hoffman; idna-update at alvestrand.no
Subject: Re: idna-bis and Eszett

(I've changed the subject line because the Sharp S / Eszett
character has been mangled well beyond recognition as it has
moved back and forth among mail systems -- something that should
be a warning to all of us.)

--On Tuesday, 27 November, 2007 13:04 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

>>> Yes, that is a problem with the IRI spec.
>...
> What I'm surprised is the lack of understanding and
> responsibility when proposing making potentially wide-reaching
> changes to a spec. What idnabis does is to change the rules
> for non-ASCII domain names. Up to now, a sharp s in a domain
> name was mapped to 'ss'. With idnabis, such a sharp s is
> simply 'a user interface issue'.

Martin,

I'm not inclined to worry about whether this is a "problem with
the IRI spec" or elsewhere.  But we clearly have a problem, or
perhaps several interconnected ones.  I believe that the
following statements are all true:

(1) By using NFKC at both registration and lookup time, IDNA2003
permits a large number of mappings to occur.   If end users have
become dependent on those mappings for export and interchange,
they are, to a greater or lesser extent, in trouble.

(2) In addition to the NFKC mappings, there are a few mappings
that are moderately to IDNA, including the one of Eszett.  They
raise all of the issues of (1), but have the further
disadvantage (for compatibility purposes) of being fairly easy
to type on keyboards designed for the countries/ languages that
use those characters (the characters mapped out by NFKC are
typically harder to type).

(3) These mappings have been a source of user confusion and some
confusion in systems trying to use IDNs are if they were
ordinary domain names.   For example, one cannot get a character
that is mapped out back from a reverse lookup.  While that
raises no issues if, as specified, one compares only the ACE
forms, users who attempts a visual comparison will be in more or
less trouble, depending on how much they understand the writing
system.

(4) The larger registry operators who are handling IDNs are
increasingly refusing to accept registrations in raw form,
permitting only the ACE form or ToUnicode(ToASCII(string)) to be
registered.  As far as they are concerned, there is no such
thing as label containing Eszett, only labels containing the
"ss" sequence.

(5)  There have been some user complaints and confusion about
IDN mapping to and from the ACE form losing information.

(6) There have been some complaints that Eszett cannot be
actually stored in an IDN, i.e., preserved in conversions to and
from the ACE form.

(7) It appears that the standard orthographic rules about
whether it is appropriate or desirable to replace Eszett with
"ss" vary among German-speaking countries, so there is less
guidance from common practice than might appear at first glance.

So...

> Independent of whether mapping in idna2003 was a good idea
> or not, what the above change does is to just leave some
> domain names foat in the air.

Well, it is not "independent", because the mappings have turned
out to be problematic.   And, strictly speaking, no domain names
are up in the air, only external presentation forms of domain
names.  But, semantics technicalities aside, this situation
represents a real and significant tradeoff and nothing is cast
in stone.   If it is better to map Eszett -> "ss" in the
protocol than to reject it (at the protocol level) that can
certainly be done.  In making that decision, we do need to
understand that there is a slippery slope between the mappings
of (2) about and those of (1) and between either of those an
alternate label separators, which actually introduced a
conceptual bug into IDNA.

They are calling my plane; more later.

regards,
    john

_______________________________________________
Idna-update mailing list
Idna-update at alvestrand.no
http://www.alvestrand.no/mailman/listinfo/idna-update