Casefolding Sigma (was: Re: IDNAbis Preprocessing Draft)

John C Klensin klensin at jck.com
Tue Jan 22 20:51:25 CET 2008



--On Tuesday, 22 January, 2008 19:01 +0000 Michael Everson
<everson at evertype.com> wrote:

> At 13:42 -0500 2008-01-22, John C Klensin wrote:
> 
>> By contrast, final form sigma is not about character confusion
>> in any way.  It is about:
>> 
>>	(i) Whether final form characters are fundamentally
>>	different characters than the base forms of the same
>>	characters?
> 
> Capital B and small b are different characters. Capital Sigma,
> small sigma, and small final sigma are different characters.

Michael, they are "different characters" because we, or you, or
someone else have decided that they are different.  Certainly
they are typographically different, but the DNS is not about
typography.

They are not inherently different.  Modern character coding
terminology insists on calling the script in which capital B and
small b occur "Latin", but case distinctions are, if I recall, a
relatively recent addition to the language of that name.  They
may look different on the printed page (more different in some
fonts than in others), but they are not different when typed,
certainly less different than either one is from "c", because
the shift-key activity is precisely equivalent to the
"presentation variation marker" option I discussed in my note.
Whether the data stream from that keyboard is transmitted as a
bit in a scan code or as separate codes for the shift and the
key selected is just a matter of convention and system
convenience.

Similarly, whether capital B and small b are collated together
(treated as interchangeable), one is placed before the other, or
all of the capital letters are following by all of the small
ones (or vice versa) is a matter of convention, not a function
of the characters themselves.   It is those collation and
matching operations that are important for the DNS, not
typography or any other context in which one can say "they are
different characters" without any chance of ambiguity.

>>	(ii) Whether it is necessary and desirable to encode
>>	typographic variations in the DNS for IDNs.   Note that
>>	a "yes" answer to this question puts one on a slippery
>>	slope toward needing to encode glyphs and fonts, rather
>>	than characters.
> 
> Small final sigma is not a "typographic" variation of small
> sigma. They are not freely interchangeable. So your question
> is not appropriate, it seems to me.

How do you feel about U+FB51?  Is it a variation on ALEF WASLA
and hence a compatibility character?  Or is Unicode in error?
It seems to me that you can't have it both ways.

And, again, for IDN purposes, we aren't asserting that these are
the same character.  I'll leave that argument to you, the UTC,
and whomever else makes up the community of philosophers of
characters and coding systems.   Our concern is what has to be
treated as equivalent for DNS matching purposes.

>>	(iii) What the general rules should be for presentation
>>	variations of characters that are normally
>>	position-sensitive and whether Greek final sigma is a
>>	sufficiently special case that it should be treated
>>	differently from all other final forms or
>>	context-sensitive presentation forms more generally.
> 
> The idea that Greek users should not be allowed to use final
> sigma is shocking to me. There must be a technical solution
> found for them.

Nothing prevents their "using final sigma".  Under IDNA2003,
they can "use" it, but it is case-mapped to lower case sigma.
Once that mapping occurs, they can't get it back, but I suppose
that is different from not being able to use it.  Under
IDNA200X, as proposed, their localized software can map it to
lower case sigma before getting to IDNA and can, if desired, map
any sigma character in a final position into the final form on
the way from the ACE form to the native character one.  

What they are prevented from doing is to have a label, say,
αβςψω and expect to distinguish it from αβσψω.  In
IDNA2003, ToASCII of either one is going to yield xn--mxac5cuaf
and ToUnicode(xn--mxac5cuaf) is going to yield αβσψω.   And
I have deliberately used final sigma in the middle of the string
in that example because there is no restriction in IDNA (either
version) that enforces "final".

If you are convinced that it is really a different character, no
more related to sigma than alpha is to omega, then the IDNA2003
case-mapping action was a mistake.  I suggest that horse is long
out of the barn but, if a strong enough argument could be made
now for changing the definitions incompatibly and treating it
(and, I presume, all other final forms) as "different", then the
two strings above would become different (i.e., would not map to
the same label).   I haven't done a systematic survey, but I
suspect that as many users of Greek would find it surprising if
those strings were construed as different (given other mappings
that occur) as would be surprised the other way.  I suspect that
most Greek users would find the αβςψω form exceedingly
strange, but, for that problem and in a DNS/IDN context, one is
probably much better off with a ban on the final form in the
protocol than treating it as a separate, "different", character.

    john






More information about the Idna-update mailing list