Casefolding Sigma (was: Re: IDNAbis PreprocessingDraft)

Fri Jan 25 11:45:22 CET 2008

Hello John,

At 18:38 08/01/25, John C Klensin wrote:
>
>--On Friday, 25 January, 2008 17:48 +0900 Martin Duerst
><duerst at it.aoyama.ac.jp> wrote:

>> Also, it creates a totally unnecessary difference between how
>> IDNs have to be dealt with and how other text is dealt with.
>> Any such difference will mean confusion to users and additional
>> work for implementers.
>
>All true.
>
>Unfortunately, the bottom line is that, without throwing the
>IDNA concept away and insisting on IDN-aware DNS servers with
>special code in them, there are only three possible ways to
>treat a character like this and two of them aren't very
>different:
>
>       (i) We do exactly what IDNA2003 does, which involves
>       introducing a special case into the protocol that maps
>       final sigma into ordinary lower-case sigma. 
>       
>       (ii) We do what the combination of the Unicode lowercase
>       rules and IDNA200X's general "no mapping" rules predict,
>       which is to prohibit the character and permit it to be
>       mapped to sigma in preprocessing software.
>       
>       (iii) We treat it as a completely separate character
>       from sigma, no more related to sigma than alpha is to
>       omega.

What about (iii'): We treat it as a completely separate character
from lower-case sigma, but permit upper-case sigma to be mapped
to it in the right circumstances (with the predicted 'localized
software') and we permit lower-case sigma to be mapped to it
in the right circumstances (via explicit case-by-case DNS mapping).
I'm very sure we can't prohibit the later; I'm not totally sure
about how much latitude there is planned for 'localized software',
so I'm not sure whether the current IDNA200X architecture would
premit the former (but I guess we could fix that).

>In practice, (i) and (ii) are nearly the same in that a
>conversion from a string that contains final sigma will lead to
>an A-label that does not contain the information that the form
>was present.

Yes, except that in the case of (i), this is guaranteed,
whereas for (ii), it's everybody's guess. 'localized
software' may mean that it works for Greeks at home, but
not somewhere in an Internet cafe when on a trip. 'localized
software' may also mean that it works in some applications/
on some OSes, but not on others. 'localized software' may
also mean that it works when directly typing into a browser,
but not when including in plain text from which it is extracted
by a script.

>Consequently, conversion from that A-label to a
>U-label will _never_ contain the final form for either case.
>They differ because (i) introduces a rather nasty special case
>in which we start specifying special processing rules (rather
>than classification rules) for a few (very few) individual
>characters

Well, in IDNA2003, these were quite a few. I'd agree that
if we end up with just very few of these, from an engineering
viewpoint it doesn't look good. But there are other viewpoints
to consider.

>and we break the rule that U-labels and A-labels can
>be exactly recovered from each other.  That rule has been, so
>far, one of the major confusion-reducing advantages of the
>IDNA200X model relative to the IDNA2003 one.

In an earlier mail, you wrote that the fact that these can't
be recovered from each other was a (major) source of confusion.
I could agree with this statement, although the main places
where I have personally seen this confusion is for programming,
where I think it can be solved, not on the end user side.

However, I definitely cannot agree with the way the statement
is worded above. It may be true that in a world where there
are only U-labels and A-labels, confusion would be reduced
with IDNA200X. But then we have 'localized software', which
introduces another layer, with something that I'll call
L-labels here for convenience.

My assumption would be that there is very, very little chance to
actually reduce overall confusion by doing all of the following:
a) Increasing the number of kinds of labels from 2 to 3,
b) Moving the confusion from a lower layer to an upper layer
   (which still has to be addressed by programmers, and is still,
   or ever the more, visible by end users), and
c) Changing the mapping from a single, standardized one (nameprep)
   to just some suggestions or not even that.

If you have good reasons to think that my assumptions are wrong,
I'd like to hear them.

>The third option is certainly feasible, but, at least absent
>some registry restrictions that are very specific to Greek, I
>can see all sorts of ways to create user astonishment with it --

Yes indeed. Absent some registry restrictions specific to a
script, there can be user astonishment in a lot of scripts.

>any user who even minimally reads Greek knows that sigma and
>final sigma are a lot more closely related than alpha and omega
>(or even alpha and beta), even though some might argue about
>whether they were actually the same character or not.  Such
>registry restrictions are not hard to write and any of three
>different versions would do the job, but there is no way to
>require that a registry use them.

Yes. But all of the major browsers already come with mechanisms
that allow to police registries to a higher degree than I'd
personally really want, and these mechanisms could kick in
here, too.

>Treating final sigma as a
>separate character would also require special-case treatment in
>any general-purpose preprocessing approach because the "don't
>remap valid characters" principle would apply and it would
>therefore have to be treated as an exception to the standard
>Unicode lowercase rules.

Is "don't remap valid characters" part of IDNA200X as it is
currently intended? That wouldn't be a problem, because the
main exception for mapping would be the context-sensitive
mapping of upper-case Sigma to final or non-final lower
case sigma. But as I understand, in IDNA200X, upper-case
Sigma is not a 'valid character', so there wouldn't be any
need for special rules. Or did I get something wrong here?

Regards,   Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp