Casefolding Sigma (was: Re: IDNAbis PreprocessingDraft)

Fri Jan 25 18:05:05 CET 2008

Hi Martin,

--On Friday, 25 January, 2008 19:45 +0900 Martin Duerst
<duerst at it.aoyama.ac.jp> wrote:

>> Unfortunately, the bottom line is that, without throwing the
>> IDNA concept away and insisting on IDN-aware DNS servers with
>> special code in them, there are only three possible ways to
>> treat a character like this and two of them aren't very
>> different:
>> 
>>       (i) We do exactly what IDNA2003 does, which involves
>>       introducing a special case into the protocol that maps
>>       final sigma into ordinary lower-case sigma. 
>>       
>>       (ii) We do what the combination of the Unicode lowercase
>>       rules and IDNA200X's general "no mapping" rules predict,
>>       which is to prohibit the character and permit it to be
>>       mapped to sigma in preprocessing software.
>>       
>>       (iii) We treat it as a completely separate character
>>       from sigma, no more related to sigma than alpha is to
>>       omega.
> 
> What about (iii'): We treat it as a completely separate
> character from lower-case sigma, but permit upper-case sigma
> to be mapped to it in the right circumstances (with the
> predicted 'localized software') and we permit lower-case sigma
> to be mapped to it in the right circumstances (via explicit
> case-by-case DNS mapping). I'm very sure we can't prohibit the
> later; I'm not totally sure about how much latitude there is
> planned for 'localized software', so I'm not sure whether the
> current IDNA200X architecture would premit the former (but I
> guess we could fix that).

It may be the hour and lack of sleep, but I'm not sure exactly
what you are proposing here and the details are very important.

It seems to me that we need to distinguish between issues of
transition from IDNA2003 and IDNA200X and longer-term issues in
an IDNA200X environment.

As a transition issue, treating final sigma as a separate,
distinguishable, character is very nearly a showstopper.  Since
IDNA2003 maps it to sigma and IDNA200X would map it to itself,
we would have different valid interpretations of the same string
and different ACE forms as a result.  That is bad enough news
that I don't see how one could do it without a prefix change and
I think it would be very hard to justify the rather considerable
costs of a prefix change on the basis of this one character.
Even if one contemplated doing that, we'd have to figure out a
way to promise the community that we would never discover any
other character that would justify different treatment than it
had initially and that would therefore require another prefix
change.  The odds of that being necessary are very low,
partially because IDNA200X gets all of the mappings out of the
protocol.  But they certainly would not be zero once one opened
this door.  

Put differently, if one looks at the transition issues, because
of decisions made for IDNA2003 we are no longer in a position to
have a meaningful discussion of final sigma as a distinct
character -- the window on that closed in 2002 (unless one is
willing to advocate for a prefix change).

Now, let's ignore that for a while and pretending that we are
dealing with a blank slate with no backward-compatibility
constraints.

Let me use the term "is coded as" for the string that is
actually represented in ACE form after processing through
punycode and whatever other processing the standard specifies
prior to that.

If something is coded as lower-case sigma, it is perfectly
reasonable under IDNA200X for local presentation software to
display it as final sigma if it is in an end-of-word position
(however one defines that in the hyphen-separated and other
cases).  Perhaps it is reasonable under IDNA2003 as well, but,
as I read that standard, it appears to anticipate that the
results of applying ToUnicode to an ACE form will be displayed
the way they come out and not as any of the possible forms that
could produce the same ACE form.

In IDNA200X, it would be reasonable for that localized software
to map an uppercase sigma that appeared in a final position to
final sigma.  It would not be reasonable to map a lowercase
sigma in the same position to final sigma because lowercase
sigma is a "real" character (i.e., one that can be coded into an
A-label).  Probably that is ok from a Greek standpoint because
having a lowercase sigma in that position at all would be an
oddity -- either a mistake or someone trying to make a specific
point that should be preserved.

But I don't know what it would mean to map lowercase sigma to
final sigma "under the right circumstances (via explicit
case-by-case DNS mapping)".  The only type of explicit
case-by-case DNS mapping I can think of would be to have the
label containing the lowercase sigma associated with a CNAME or
DNAME record that pointed to an otherwise-identical label
containing a final-form sigma.  That type of mapping certainly
cannot be prohibited.  It would cause some interesting problems
with email and some other protocols because of restrictions on
the use of aliases.  It would violate current policy in many
domains (including almost all TLDs) against CNAME entries.  If
one made rules requiring that type of mapping, there would be no
way to enforce them.  But, otherwise, one could do it.

>> In practice, (i) and (ii) are nearly the same in that a
>> conversion from a string that contains final sigma will lead
>> to an A-label that does not contain the information that the
>> form was present.
> 
> Yes, except that in the case of (i), this is guaranteed,
> whereas for (ii), it's everybody's guess. 'localized
> software' may mean that it works for Greeks at home, but
> not somewhere in an Internet cafe when on a trip. 'localized
> software' may also mean that it works in some applications/
> on some OSes, but not on others. 'localized software' may
> also mean that it works when directly typing into a browser,
> but not when including in plain text from which it is extracted
> by a script.

This is, of course, the argument for either incorporating
comprehensive mapping into the protocol (or requiring it in pre-
and post-processing) or for treating every character as itself
(even obvious case relationships) and letting either the user or
registries (presumably by variant techniques and/or DNS aliases)
sort things out.   The counterarguments are that many mappings
that may seem obvious and correct to someone who is familiar
with a given script may be extremely obscure to someone who is
not and, more important, that a mistake in a mapping may be
impossible to correct later.  

This discussion about final sigma is a perfect example of the
latter.  The decision was made a half-dozen years ago to map it
to lowercase sigma in Stringprep.  Presumably there were good
reasons at the time; Ken's response to my naive question about
final-form characters explains several of them.  But, today, we
are sort of stuck with that decision.  Even if there were
consensus that we should be doing something different, we may be
stuck with the earlier decision.  Making a change that preserves
the information that the original character was final form sigma
now requires either: 

	* accepting the fact that any labels written with final
	form sigma will be resolved differently when processed
	by IDNA2003 and IDNA200X-conformant applications.

	* deciding that this issue is so important that we need
	to change the prefix, making things unambigious but
	either invalidating all existing registrations or, more
	likely, requiring that every application that calls a
	resolver to support both IDNA200X and IDNA2003 code, to
	do double lookups, and to have a rule about how to sort
	things out when both interpretations resolve.

To me, neither of those options seem plausible.  That is why I'm
resisting a change to the protocol-level handling of final sigma
even while I wish we had handled it differently the first time
around.  In addition, noticing what is happening here is one of
the reasons I'm so strongly supportive about having MAYBE
categories: caution is in order when making decisions one cannot
un-make and it seems unlikely to me that we can guarantee that
we can get everything right all at once.  And final sigma seems
to prove that point.

>> Consequently, conversion from that A-label to a
>> U-label will _never_ contain the final form for either case.
>> They differ because (i) introduces a rather nasty special case
>> in which we start specifying special processing rules (rather
>> than classification rules) for a few (very few) individual
>> characters
> 
> Well, in IDNA2003, these were quite a few.

Actually, there aren't very many.  There are many cases in which
we map one character into another, the vast majority of which
are determined by NFKC.  There are a much small number in which
we map one character into another using casemapping rules but
for which one upper case letter maps to one lowercase letter.
The number of cases in which we map a lowercase letter to
another lowercase letter or to a string of characters that would
look arbitrary to someone unfamiliar with the details of how the
script is used in context with a particular language or set of
languages is very few, I believe on the order of two.

> I'd agree that
> if we end up with just very few of these, from an engineering
> viewpoint it doesn't look good. But there are other viewpoints
> to consider.

I'm really not worried about the "engineering viewpoint".  I am
very concerned about getting ourselves back into the situation
that registrants, registrars, and registries can predict what is
valid in an IDN and how it is interpreted by running some
procedure on a computer, a procedure that they can only hope is
error-free.

>> and we break the rule that U-labels and A-labels can
>> be exactly recovered from each other.  That rule has been, so
>> far, one of the major confusion-reducing advantages of the
>> IDNA200X model relative to the IDNA2003 one.
> 
> In an earlier mail, you wrote that the fact that these can't
> be recovered from each other was a (major) source of confusion.
> I could agree with this statement, although the main places
> where I have personally seen this confusion is for programming,
> where I think it can be solved, not on the end user side.

We have certainly seen it on the registration side.  One could
quibble about whether would-be registrants are end users, but
they are certainly similar to them in many ways.  And, from the
standpoint of DNS registration, they are the customers.

Let me respond to the rest of your note at another time, unless
others do so first... I have some other commitments I need to
deal with today.

    john