AW: sharp s (Eszett)

Kenneth Whistler kenw at sybase.com
Mon Mar 10 21:55:21 CET 2008


Jelte Jansen said:

> But basically, if I understand this right, the problem here is the old
> rule 'when the sharp s symbol is not available, use ss'. Which, even if
> it is wrong, is what we were taught in school. You may blame the dutch
> school system in my case, but whether or not it's true, it's apparently
> engraved in more minds. And I'm guessing the source of the lowercasing
> to 'ss' in CaseFold.txt. Which would be weird, since by definition the
> sharp s is available. Please correct me if I'm wrong here.

I think you are mixing up fallback and casefolding.

Fallback is the kind of issue you have when some character
(or glyph) is not available for use (or rendering), and you
need to fall back to something which *is* available. What you
choose for fallback depends on what you have available, but
you generally choose something which is mnemonic in some
way and is related to what is missing.

Casefolding -- at least the casefolding formally defined in
the Unicode data file, CaseFolding.txt -- is another matter
entirely. That file defines formal equivalence classes based
on case mapping relationships. For most simple case mappings
of the type a --uc--> A and A --lc--> a, you have a symmetric
relationship in the case mappings and a simple case pair.
That case pair then is identified as an equivalence
class {a, A}, and for the purposes of casefolding, one
element of each equivalence class (the lowercase of the pair)
is taken as the "folding" for all elements of the class.

The complication sets in when you have *non*-symmetric case
mappings, as for German sharp s:

   s --uc--> S,  S --lc--> s
   ß --uc--> SS, SS --lc--> ss
   ss --uc--> SS
   
For full casefolding, that creates an equivalence class
{ss, ß, SS}, and the "ss" is taken as the "folding" for
all elements of that class.

So this determination has nothing to do with fallback, per se,
but results from asymmetric case mapping.


> So if IDNAbis would make an exception for the sharp S, and allow it as
> a separate symbol, would there be people running to their lawyers
> because they think it's equivalent to ss too ...

It *is* equivalent to ss, too. It just depends on what level
and type of equivalence you are talking about. They aren't
equivalent for spelling, obviously -- but they *are* equivalent
for some types of searching and sorting.

> (even besides the backwards
> compatibility problem)? And wouldn't every other protocol that needs
> normalization need this exception?

No. By the way, normalization and casefolding are not the same
at all. U+00DF LATIN SMALL LETTER SHARP S is completely
unaffected by *all* forms of Unicode normalization.

> In which case it would probably be a
> better mission to try and get the casefolding entry out of Unicode.

I think not. I don't think there is anything wrong at all with
the casefolding treatment of ß in the Unicode Character Database,
and I doubt that the UTC would countenance a change for it,
because of the significant (negative) impact that would have on
implementations using those data files.

--Ken



More information about the Idna-update mailing list