Casefolding Sigma (was: Re: IDNAbis Preprocessing Draft)

Fri Jan 25 10:38:47 CET 2008

--On Friday, 25 January, 2008 17:48 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

> [I'm sorry for the delay in answering some of the messages
> due to other urgent duties (term-final exams).]
> 
> 
> At 04:51 08/01/23, John C Klensin wrote:
> 
>> Nothing prevents their "using final sigma".  Under IDNA2003,
>> they can "use" it, but it is case-mapped to lower case sigma.
>> Once that mapping occurs, they can't get it back, but I
>> suppose that is different from not being able to use it.
> 
> Whether it's different or not depends on the definition of use,
> but I'm sure that users will see it that way: They can't use it
> in the way they usually would.
> 
> You have, in earlier mail, pointed out that this kind of
> behavior is confusing to users, and I think you also have
> given that confusion as one of the reasons for the need of
> work on IDNA.
> 
> 
>> Under
>> IDNA200X, as proposed, their localized software can map it to
>> lower case sigma before getting to IDNA and can, if desired,
>> map any sigma character in a final position into the final
>> form on the way from the ACE form to the native character one.
> 
> At first sight, that may make sense. But we currently have
> absolutely *zero* experience with what's subsumed here under
> the term 'localized software'. In very general terms, it's an
> user interface issue, and the IETF as such has been known to
> not be specialized in such issues.
> 
> Also, it creates a totally unnecessary difference between how
> IDNs have to be dealt with and how other text is dealt with.
> Any such difference will mean confusion to users and additional
> work for implementers.

All true.

Unfortunately, the bottom line is that, without throwing the
IDNA concept away and insisting on IDN-aware DNS servers with
special code in them, there are only three possible ways to
treat a character like this and two of them aren't very
different:

	(i) We do exactly what IDNA2003 does, which involves
	introducing a special case into the protocol that maps
	final sigma into ordinary lower-case sigma. 

	(ii) We do what the combination of the Unicode lowercase
	rules and IDNA200X's general "no mapping" rules predict,
	which is to prohibit the character and permit it to be
	mapped to sigma in preprocessing software.

	(iii) We treat it as a completely separate character
	from sigma, no more related to sigma than alpha is to
	omega.

In practice, (i) and (ii) are nearly the same in that a
conversion from a string that contains final sigma will lead to
an A-label that does not contain the information that the form
was present.  Consequently, conversion from that A-label to a
U-label will _never_ contain the final form for either case.
They differ because (i) introduces a rather nasty special case
in which we start specifying special processing rules (rather
than classification rules) for a few (very few) individual
characters and we break the rule that U-labels and A-labels can
be exactly recovered from each other.  That rule has been, so
far, one of the major confusion-reducing advantages of the
IDNA200X model relative to the IDNA2003 one.

The third option is certainly feasible, but, at least absent
some registry restrictions that are very specific to Greek, I
can see all sorts of ways to create user astonishment with it --
any user who even minimally reads Greek knows that sigma and
final sigma are a lot more closely related than alpha and omega
(or even alpha and beta), even though some might argue about
whether they were actually the same character or not.  Such
registry restrictions are not hard to write and any of three
different versions would do the job, but there is no way to
require that a registry use them.  Treating final sigma as a
separate character would also require special-case treatment in
any general-purpose preprocessing approach because the "don't
remap valid characters" principle would apply and it would
therefore have to be treated as an exception to the standard
Unicode lowercase rules.

    john