Casefolding Sigma (was: Re: IDNAbis Preprocessing Draft)

Fri Jan 25 10:10:11 CET 2008

Hello Ken,

Many thanks for forwarding the mail from Mark.

At 04:37 08/01/23, Kenneth Whistler wrote:
>Forwarding a response by Mark Davis to the question
>posed by Martin Duerst about this issue. Apologies
>if forwarding through my not-very-capable 8859-1
>email client manages to trash the UTF-8 that Mark
>used in his email.

My Japanese mailer will produce some additional damage :-(.

>--Ken
>
>------------- Begin Forwarded Message -------------
>
>
>The reason for doing this comes from the goal of having case-insensitive
>input. This kind of case folding is similar to ess-zed. Casefolding first
>establishes an equivalence class between characters. In simple cases, this
>is simple:
>
>{a, A}, {b, B}....
>
>There are a handful of more complicated cases
>
>{$B%^%
(B $B%^#
(B $B%[!W(B}, {$B%F]
(B ss, sS, Ss, SS}, {$B%H%"(B}, {$B%H!<(B, i\u0307}}
>
>It then picks a single. representative element of each of the equivalence
>classes to map each of the elements to. That makes it a true folding,
>whereby if X is a case-insensitive variant of Y, then fold(X) = fold(Y).

This kind of case folding works is very appropriate for applications such
as search. But it is not necessarily, or not at all, appropriate for other
operations. For example, asking for a string to be lower-cased shouldn't
change any of the lower-case letters.

>Now, the way this is done, one of the two sigmas has to be chosen as the
>representative.

Yes. Please note the antecedent, "the way this is done". This
doesn't serve as a justification.

>It is, of course, possible to have a folding that is
>context-sensitive, and picks the right form of the sigma according to
>position in the word. And rules for doing that are in the Unicode Standard.
>It is simpler, of course, to be context-free, and that course was chosen in
>IDNA2003. There are a couple of further hiccoughs as well. Where interior
>hyphens are used to separate words, the correct form of the sigma can be
>chosen -- but where we have wordsruntogether.com, it would not be.

I guess we don't even know how Greeks would write their words in this
case (keeping the final sigmas final, or changing them to non-final),
or whether they would run them together at all.

>It would be possible to allow for both {$B%^%
(B $B%^#^(B and {$B%F]
(B ss} in U and A labels
>in IDNAbis. However, that does preclude case-insensitive preprocessing

I can't read the characters, but if my guess is right, it wouldn't
completely preclude case-insensitive processing; it would only preclude
it for a final sigma. Of course, this is also an undesired inconsistency,
but if we are going to propose "localized software" for this job, as
the current mainstream seems to suggest, that localized software should
be able to take care of the problem. If the Greeks want to allow
final sigma in the middle of a word, we actually have all the tools
to do that, namely bundling (to reserve the other alternative to
make sure there are no spoofers) and CNAME records or whatever they
are called (sorry for not checking) to map from the non-final to the
final sigma where necessary.

>and
>compatibility with IDNA2003 for these characters.

This is the only argument for not allowing final sigma that I have
seen so far. But I'm not convinced; keeping the bad decisions and
making some more potentially very dangerous ones doesn't seem the
right way to go for me.

>The other two main problematic cases are ess-zed and dotless i. Note that
>there is, in Unicode 5.1, a capital letter ess-zed. However, represented the
>fact that this is in extremely uncommon usage, and at the request of the
>German national body, this is not the standard uppercase, which remains SS.

Yes, I agree. But the structure of ess-zed is exactly the same as for
final sigma, with the only distinction that the mapping is to a pair
of letters and not a single letter. Semantically, the ess-zed is more
important, because it marks a distinction in pronunciation, and cannot
be reconstructed (for most cases, a dictionary would do the job, but
there are some where even a dictionary approach isn't good enough).

>The dotless i ($B%H%"(B) and dotted capital I ($B%H!<(B) are special because it is
>impossible to do a language-insensitive folding of these without erasing an
>important distinction in Turkish (it would be the equivalent of folding "a"
>to "e" for other Latin-based orthographies). So case folding effectively
>treats the dotless i and dotted capital I as uncased.

Yes, that one is a really tough one.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst at it.aoyama.ac.jp