Casing Stability (was: Re: IDNAbis Preprocessing Draft)
Kenneth Whistler
kenw at sybase.com
Tue Jan 22 01:05:18 CET 2008
Harald asked, in response to Mark's draft:
> > ... Unicode has since stabilized
> > case folding, so that this won't happen in the future. That is, case
> > pairs will be assigned in the same version of Unicode -- so any newly
> > assigned character will either have a casefolding in that version of
> > Unicode, or it will never have a casefolding in the future.
> Out of curiosity: what will Unicode do if you discover a case variant
> that you didn't know existed? (the one that's been talked about a bit is
> an upper-case esszett...)
There are 3 potential situations:
1. A new uppercase is claimed for a character already
encoded in lowercase only in the standard.
Result: No problem. Encode the uppercase and add a
case mapping. This does not violate casing
stability because it does not change the
casefolding of the *existing* lowercase character.
This is actually also a not-uncommon kind of
situation, as orthographies based on IPA
in Africa, for example, invent uppercase for
letters that didn't originally have them.
2. A new lowercase is claimed for a character already
encoded in uppercase only in the standard.
Result: Encoding a new lowercase for this would be
prohibited by the casing stability policy.
This is also an exceedingly odd and rare
kind of problem to have, because there are
very few orthographies that start with
just uppercase letters, and then later innovate
by adding lowercase pairs, and even fewer that
would have bizarre uppercase characters that
wouldn't already have lowercase pairs present
in the standard. Historically, sure, since
Latin monumental capitals were the original,
and Latin miniscule forms were a later development
out of manuscript tradition. But all that kind
of development long predates all the encoding of
the bicameral scripts in IT character encoding
standards. The only recent example we know
is the Sencoten orthography, and precisely
for casing stability, lowercase characters
were added for 023A, 023E, 023F, despite their
non-attestation, precisely to prevent a future
problem should the Sencoten ever decide to add
lowercase to their orthography.
3. Uppercase esszet.
Result: This one isn't hypothetical -- it has already
happened in Unicode 5.1. But it is also a unique case,
because the lowercase esszet is a unique case.
Unicode 5.1 UnicodeData.txt entry:
1E9E;LATIN CAPITAL LETTER SHARP S;Lu;0;L;;;;;N;;;;00DF;
So it is an uppercase letter (gc=Lu), and it has a lowercase
mapping to U+00DF LATIN SMALL LETTER SHARP S
But the entry for U+00DF itself does not change:
00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;;
That is a lowercase letter (gc=Ll), but it does *not* have
an uppercase mapping to U+1E9E. Why not? Because it already
has a *special* uppercase mapping to "SS":
00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
And the entries in CaseFolding.txt are:
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S
So, the new character, U+1E9E has a *simple* case folding
to U+00DF, but U+00DF itself has a *full* case folding to
"ss" <U+0073, U+0073>, and that must stay stable.
The net of that is that any default casing operation will
never uppercase U+00DF to U+1E9E -- only a special,
tailored casing operation that knew what it was doing would
do that. And all the sharp s's have full casefoldings
to "ss".
--Ken
More information about the Idna-update
mailing list