Casing Stability (was: Re: IDNAbis Preprocessing Draft)

Tue Jan 22 01:05:18 CET 2008

Harald asked, in response to Mark's draft:

> > ... Unicode has since stabilized
> > case folding, so that this won't happen in the future. That is, case
> > pairs will be assigned in the same version of Unicode -- so any newly
> > assigned character will either have a casefolding in that version of
> > Unicode, or it will never have a casefolding in the future.

> Out of curiosity: what will Unicode do if you discover a case variant
> that you didn't know existed? (the one that's been talked about a bit is
> an upper-case esszett...)

There are 3 potential situations:

1. A new uppercase is claimed for a character already
   encoded in lowercase only in the standard.

   Result: No problem. Encode the uppercase and add a
           case mapping. This does not violate casing
           stability because it does not change the
           casefolding of the *existing* lowercase character.
           This is actually also a not-uncommon kind of
           situation, as orthographies based on IPA
           in Africa, for example, invent uppercase for
           letters that didn't originally have them.

2. A new lowercase is claimed for a character already
   encoded in uppercase only in the standard.

   Result: Encoding a new lowercase for this would be
           prohibited by the casing stability policy.
           This is also an exceedingly odd and rare
           kind of problem to have, because there are
           very few orthographies that start with
           just uppercase letters, and then later innovate
           by adding lowercase pairs, and even fewer that
           would have bizarre uppercase characters that
           wouldn't already have lowercase pairs present
           in the standard. Historically, sure, since
           Latin monumental capitals were the original,
           and Latin miniscule forms were a later development
           out of manuscript tradition. But all that kind
           of development long predates all the encoding of
           the bicameral scripts in IT character encoding
           standards. The only recent example we know
           is the Sencoten orthography, and precisely
           for casing stability, lowercase characters
           were added for 023A, 023E, 023F, despite their
           non-attestation, precisely to prevent a future
           problem should the Sencoten ever decide to add
           lowercase to their orthography.

3. Uppercase esszet.

   Result: This one isn't hypothetical -- it has already
           happened in Unicode 5.1. But it is also a unique case, 
           because the lowercase esszet is a unique case.

   Unicode 5.1 UnicodeData.txt entry:

   1E9E;LATIN CAPITAL LETTER SHARP S;Lu;0;L;;;;;N;;;;00DF;

   So it is an uppercase letter (gc=Lu), and it has a lowercase
   mapping to U+00DF LATIN SMALL LETTER SHARP S

   But the entry for U+00DF itself does not change:

   00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;;

   That is a lowercase letter (gc=Ll), but it does *not* have
   an uppercase mapping to U+1E9E. Why not? Because it already
   has a *special* uppercase mapping to "SS":

   00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

   And the entries in CaseFolding.txt are:

   00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
   1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S

   So, the new character, U+1E9E has a *simple* case folding
   to U+00DF, but U+00DF itself has a *full* case folding to
   "ss" <U+0073, U+0073>, and that must stay stable.

   The net of that is that any default casing operation will
   never uppercase U+00DF to U+1E9E -- only a special,
   tailored casing operation that knew what it was doing would
   do that. And all the sharp s's have full casefoldings
   to "ss".

--Ken