Casefolding Sigma (was: Re: IDNAbis Preprocessing Draft)

Tue Jan 22 20:37:20 CET 2008

Forwarding a response by Mark Davis to the question
posed by Martin Duerst about this issue. Apologies
if forwarding through my not-very-capable 8859-1
email client manages to trash the UTF-8 that Mark
used in his email.

--Ken

------------- Begin Forwarded Message -------------

The reason for doing this comes from the goal of having case-insensitive
input. This kind of case folding is similar to ess-zed. Casefolding first
establishes an equivalence class between characters. In simple cases, this
is simple:

{a, A}, {b, B}....

There are a handful of more complicated cases

{Ïƒ, Ï‚, Î£}, {ÃŸ, ss, sS, Ss, SS}, {Ä±}, {Ä°, i\u0307}}

It then picks a single. representative element of each of the equivalence
classes to map each of the elements to. That makes it a true folding,
whereby if X is a case-insensitive variant of Y, then fold(X) = fold(Y).

Now, the way this is done, one of the two sigmas has to be chosen as the
representative. It is, of course, possible to have a folding that is
context-sensitive, and picks the right form of the sigma according to
position in the word. And rules for doing that are in the Unicode Standard.
It is simpler, of course, to be context-free, and that course was chosen in
IDNA2003. There are a couple of further hiccoughs as well. Where interior
hyphens are used to separate words, the correct form of the sigma can be
chosen -- but where we have wordsruntogether.com, it would not be.

It would be possible to allow for both {Ïƒ, Ï‚} and {ÃŸ, ss} in U and A labels
in IDNAbis. However, that does preclude case-insensitive preprocessing and
compatibility with IDNA2003 for these characters.

The other two main problematic cases are ess-zed and dotless i. Note that
there is, in Unicode 5.1, a capital letter ess-zed. However, represented the
fact that this is in extremely uncommon usage, and at the request of the
German national body, this is not the standard uppercase, which remains SS.
The dotless i (Ä±) and dotted capital I (Ä°) are special because it is
impossible to do a language-insensitive folding of these without erasing an
important distinction in Turkish (it would be the equivalent of folding "a"
to "e" for other Latin-based orthographies). So case folding effectively
treats the dotless i and dotted capital I as uncased.

On Jan 21, 2008 6:47 PM, Martin Duerst <duerst at it.aoyama.ac.jp> wrote:

> I'm sure this has already been discussed, probably in several
> places, but thinking from a simple user perspective, why should
> final small sigma be disallowed? After all, writing a word ending
> in sigma with a non-final sigma would look really strange, or
> wouldn't it? And likewise writing a word containing a singma in
> the middle with a final sigma would look really strange, or
> wouldn't it? So in my view, it would be better to address this
> e.g. at the registry level rather than to produce bad typography.
>
> Regards,   Martin.

-- 
Mark

------------- End Forwarded Message -------------