AW: sharp s (Eszett)

John C Klensin klensin at jck.com
Mon Mar 10 23:41:58 CET 2008



--On Monday, 10 March, 2008 13:55 -0700 Kenneth Whistler
<kenw at sybase.com> wrote:

> 
> Jelte Jansen said:
> 
>> But basically, if I understand this right, the problem here
>> is the old rule 'when the sharp s symbol is not available,
>> use ss'. Which, even if it is wrong, is what we were taught
>> in school. You may blame the dutch school system in my case,
>> but whether or not it's true, it's apparently engraved in
>> more minds. And I'm guessing the source of the lowercasing to
>> 'ss' in CaseFold.txt. Which would be weird, since by
>> definition the sharp s is available. Please correct me if I'm
>> wrong here.
> 
> I think you are mixing up fallback and casefolding.
> 
> Fallback is the kind of issue you have when some character
> (or glyph) is not available for use (or rendering), and you
> need to fall back to something which *is* available. What you
> choose for fallback depends on what you have available, but
> you generally choose something which is mnemonic in some
> way and is related to what is missing.
> 
> Casefolding -- at least the casefolding formally defined in
> the Unicode data file, CaseFolding.txt -- is another matter
> entirely. That file defines formal equivalence classes based
> on case mapping relationships. For most simple case mappings
> of the type a --uc--> A and A --lc--> a, you have a symmetric
> relationship in the case mappings and a simple case pair.
> That case pair then is identified as an equivalence
> class {a, A}, and for the purposes of casefolding, one
> element of each equivalence class (the lowercase of the pair)
> is taken as the "folding" for all elements of the class.
> 
> The complication sets in when you have *non*-symmetric case
> mappings, as for German sharp s:
> 
>    s --uc--> S,  S --lc--> s
>    ß --uc--> SS, SS --lc--> ss
>    ss --uc--> SS

But Ken, if I correctly understand what has been said on the
list, and what Duden and other authorities say about German,
were it not for fallback issues, there would be no relationship
between Eszett and the "ss" sequence.    If that relationship
did not exist, then the above would be

    s --uc--> S,  S --lc--> s (irrelevant)

    ß --uc--> uppercase ß (which Unicode does not have before
5.1 and is dubious from a historical orthography standpoint or
    ß --uc--> ß (a caseless assumption) or
    ß --uc--> (some sort of fault)

and, of course, the above operations with ß are either
reversible and symmetric or involve more special cases.

   SS --lc--> ss  and  ss --uc--> SS (also irrelevant)

>From that point of view, the problem here isn't with Eszett.  It
is with the (quite natural, but possibly wrong in a few odd
cases) that, if a script has case distinctions, all of its
characters have case distinctions _and_ the imposition of a
fallback on (or confusion of a fallback with) Eszett.  

Change either of those things and ignore the introduction of an
upper-case Eszett in 5.1, and one ends up with
    
   ß --uc--> ß   and
   ß --lc--> ß

which looks a little strange but is perfectly natural and, I
think, consistent with what happens when one applies uc or lc
operations to characters that don't have case.

> For full casefolding, that creates an equivalence class
> {ss, ß, SS}, and the "ss" is taken as the "folding" for
> all elements of that class.

Only because of the introduction of the fallback into that
model, unless I'm missing something.

> So this determination has nothing to do with fallback, per se,
> but results from asymmetric case mapping.

Again, that is because someone decided to make ß --uc--> SS,
but that is basically a fallback for the absence of a character.

I am not suggesting that what is happening is wrong --that is a
separate issue-- only that we are in this state of affairs
because of a fallback situation, a case mapping (to upper case)
that is historically reasonable according to some authorities
and dubious according to others, and a casefolding operation
that is defined in a specific way that almost certainly works
for the vast majority of cases but that does not work perfectly
for this one.

     john



More information about the Idna-update mailing list