local mappings

Sat Jan 24 02:47:22 CET 2009

John,

Skipping over the long discussion of the wheres and whyfores,
and getting to what I personally consider the meat of the
question at hand:

> So, for example, we can reasonably expect that, for scripts with
> case, users will be astonished if upper case characters don't
> map to lower case ones.

IMO, yes.

> Could we agree to apply a lower-case
> mapping globally 

OMG, yes!

> -- lower-case and not CaseFold, because
> CaseFold is designed for comparison and not well-suited to
> mapping (more or less quoting from TUS on that) and because the
> subtle properties of CaseFold in the cases in which it doesn't
> produce what LowerCase produces are themselves astonishing to
> the unsophisticated)?

Actually, the only "astonishing" instances for CaseFold are
precisely the problematical instances that everybody knows
about, and which this list has been stumbling over for the
protocol: German esszet, Greek final sigma, and Turkish i's.
Since they *are* exceptions to the general rules, it
shouldn't really be a surprise to anybody if the protocol
handles them as special cases.

>    If we do lower-case, but continue to
> ban compatibility characters and the other odd cases that
> surprise those who don't know what is going on, does that help
> us significantly with the compatibility and astonishment
> situations that are really important? 

Yes, absolutely. If the group is serious about this, this
can very much clear up the perceived complexity in the
documents, as well as better matching the expectations of
both protocol implementers and end users, IMO. *And* it
gives you a way to avoid the entire morass of opening up
the protocol to the mercy of undefined "local mappings".

> I don't think there are
> many other situations similar to the lower-case one (which I
> assume is why it keeps coming up in examples), but need advice
> from Mark, Ken, and others as to whether there are any others
> and what they are.

Nor do I. I'll explain what I think the implications are
below.

> And I can only hope that even suggesting
> this doesn't open cans of worms and arguments about which
> mappings are more important than others.

It shouldn't. You can end up with a simpler and more elegant
protocol this way, *and* avoid the security problem everybody
is so rightfully concerned about.

O.k., details.

Currently we have the following specification, using
exemplary letters:

CP   character               Table A derived value

0061 lowercase-a             PVALID
0041 uppercase-A             DISALLOWED
00E1 lowercase-a-acute       PVALID
00C1 uppercase-A-acute       DISALLOWED
00DF lowercase-sharp-s       PVALID (by exception)
1E9E uppercase-sharp-S       DISALLOWED
03B1 lowercase-alpha         PVALID
0391 uppercase-Alpha         DISALLOWED
03C2 lowercase-final-sigma   PVALID (by exception)
03C3 lowercase-sigma         PVALID
03A3 uppercase-Sigma         DISALLOWED
0131 lowercase-dotless-i     PVALID
0130 uppercase-I-with-dot    DISALLOWED

The uppercase letters all end up DISALLOWED as the result
of the complex and hard to explain rule Unstable (B):

toNFKC(toCasefold(toNFKC(cp))) != cp

Sharp-s and lowercase-final-sigma would be DISALLOWED by
the rule, so to match apparent user requirements, the
table spec has to stick them in the Exceptions rule (F).

And then to deal with the one screwy casing issue left,
the Turkish I casing issue, the protocol opens itself
up to local mappings to allow the possibility for
0130 uppercase-I-with-dot and its unusual case pairings
to work for Turkish, despite the fact that the table
gives 0130 a DISALLOWED value.

Furthermore, to get labels to behave the way people expect
them to in browsers, etc., everybody is depending on
the *externally* defined definition of case folding,
even though to build the derivation for Table A, the
protocol is internally depending on toCaseFold, anyway.

O.k., now here is how this will all look in the specification,
*if* you make toLowerCase() a part of the specification
formally.

CP   character               Table A derived value

0061 lowercase-a             PVALID
0041 uppercase-A             PVALID
00E1 lowercase-a-acute       PVALID
00C1 uppercase-A-acute       PVALID
00DF lowercase-sharp-s       PVALID
1E9E uppercase-sharp-S       PVALID
03B1 lowercase-alpha         PVALID
0391 uppercase-Alpha         PVALID
03C2 lowercase-final-sigma   PVALID
03C3 lowercase-sigma         PVALID
03A3 uppercase-Sigma         PVALID
0131 lowercase-dotless-i     PVALID
0130 uppercase-I-with-dot    CONTEXTO (by exception)

In idnabis-tables, you further simplify as follows:

The rule Unstable (B) gets simplified to:

toNFKC(cp) != cp

All it needs to do is eliminate all the *normalization*
unstable characters, and no longer needs to also deal
with case folding at the same time.

00DF lowercase-sharp-s and 03C2 lowercase-final-sigma
no longer need to be in the Exceptions (F) list, because
they aren't removed by a CaseFolding criterion in the
first place.

The idnabis-tables spec can also entirely dispense with
the LDH (E) rule, because 002D is already in the Exceptions
(F) list, there is nothing exceptional now about a-z,
and 0-9 get handled just like other digits, automatically.

If the protocol then requires, both for registration
and lookup, the application of toLowerCase(), you end
up with precisely the correct results for everything,
with 3 exceptions. I don't think you need to guess that
they are:

German esszet casing.

Greek final sigma casing.

Turkish i casing.

For esszet and final sigma, I don't think you need to do
anything other than you are already doing. Those characters
end up PVALID (automatically now, rather than by exception
table), but you still need the registries to be aware
of special issues when registering labels using them.

Only Turkish i casing is really hard -- but that is no
different than the situation you were already in for that.
My suggestion for Turkish i, in the spirit of the other
special character requirements for the CONTEXTJ characters
and the other few exceptions the working group wants to
allow in, would be to specify that U+0130 LATIN CAPITAL
LETTER I WITH DOT ABOVE be made CONTEXTO in the exceptions
table, and then figure out how to specify a workable
Turkish casing rule for it in the context rules.
In other words, if you cannot do Turkish casing correctly,
in the right context, then you don't use U+0130, period.

And even *if* the specification of a context rule for
Turkish casing turns out to be problematical, you are in
a no worse situation than simply throwing up your
hands, metaphorically, and leaving Turkish i casing as
a player to be designated later, so to speak, while
opening the protocol up to unknown and essentially
unknowable attacks by people attempting mischievous
local mappings.

So here is a hearty endorsement for taking this direction,
which I think is a win-win-win-win, as they say. It
simplifies the table derivation, it makes everything
easier to understand, it better matches user expectations
(avoiding the astonishment problem), and it avoids
the need for the ominous security hole of "local mapping".

--Ken