local mappings

Sat Jan 24 05:03:35 CET 2009

Ken,

Thanks very much for the careful analysis and details.

Now let's see how others feel about it.

   john

--On Friday, January 23, 2009 17:47 -0800 Kenneth Whistler
<kenw at sybase.com> wrote:

> John,
> 
> Skipping over the long discussion of the wheres and whyfores,
> and getting to what I personally consider the meat of the
> question at hand:
> 
>> So, for example, we can reasonably expect that, for scripts
>> with case, users will be astonished if upper case characters
>> don't map to lower case ones.
> 
> IMO, yes.
> 
>> Could we agree to apply a lower-case
>> mapping globally 
> 
> OMG, yes!
> 
>> -- lower-case and not CaseFold, because
>> CaseFold is designed for comparison and not well-suited to
>> mapping (more or less quoting from TUS on that) and because
>> the subtle properties of CaseFold in the cases in which it
>> doesn't produce what LowerCase produces are themselves
>> astonishing to the unsophisticated)?
> 
> Actually, the only "astonishing" instances for CaseFold are
> precisely the problematical instances that everybody knows
> about, and which this list has been stumbling over for the
> protocol: German esszet, Greek final sigma, and Turkish i's.
> Since they *are* exceptions to the general rules, it
> shouldn't really be a surprise to anybody if the protocol
> handles them as special cases.
> 
>>    If we do lower-case, but continue to
>> ban compatibility characters and the other odd cases that
>> surprise those who don't know what is going on, does that help
>> us significantly with the compatibility and astonishment
>> situations that are really important? 
> 
> Yes, absolutely. If the group is serious about this, this
> can very much clear up the perceived complexity in the
> documents, as well as better matching the expectations of
> both protocol implementers and end users, IMO. *And* it
> gives you a way to avoid the entire morass of opening up
> the protocol to the mercy of undefined "local mappings".
> 
>> I don't think there are
>> many other situations similar to the lower-case one (which I
>> assume is why it keeps coming up in examples), but need advice
>> from Mark, Ken, and others as to whether there are any others
>> and what they are.
> 
> Nor do I. I'll explain what I think the implications are
> below.
> 
>> And I can only hope that even suggesting
>> this doesn't open cans of worms and arguments about which
>> mappings are more important than others.
> 
> It shouldn't. You can end up with a simpler and more elegant
> protocol this way, *and* avoid the security problem everybody
> is so rightfully concerned about.
> 
> O.k., details.
> 
> Currently we have the following specification, using
> exemplary letters:
> 
> CP   character               Table A derived value
> 
> 0061 lowercase-a             PVALID
> 0041 uppercase-A             DISALLOWED
> 00E1 lowercase-a-acute       PVALID
> 00C1 uppercase-A-acute       DISALLOWED
> 00DF lowercase-sharp-s       PVALID (by exception)
> 1E9E uppercase-sharp-S       DISALLOWED
> 03B1 lowercase-alpha         PVALID
> 0391 uppercase-Alpha         DISALLOWED
> 03C2 lowercase-final-sigma   PVALID (by exception)
> 03C3 lowercase-sigma         PVALID
> 03A3 uppercase-Sigma         DISALLOWED
> 0131 lowercase-dotless-i     PVALID
> 0130 uppercase-I-with-dot    DISALLOWED
> 
> The uppercase letters all end up DISALLOWED as the result
> of the complex and hard to explain rule Unstable (B):
> 
> toNFKC(toCasefold(toNFKC(cp))) != cp
> 
> Sharp-s and lowercase-final-sigma would be DISALLOWED by
> the rule, so to match apparent user requirements, the
> table spec has to stick them in the Exceptions rule (F).
> 
> And then to deal with the one screwy casing issue left,
> the Turkish I casing issue, the protocol opens itself
> up to local mappings to allow the possibility for
> 0130 uppercase-I-with-dot and its unusual case pairings
> to work for Turkish, despite the fact that the table
> gives 0130 a DISALLOWED value.
> 
> Furthermore, to get labels to behave the way people expect
> them to in browsers, etc., everybody is depending on
> the *externally* defined definition of case folding,
> even though to build the derivation for Table A, the
> protocol is internally depending on toCaseFold, anyway.
> 
> O.k., now here is how this will all look in the specification,
> *if* you make toLowerCase() a part of the specification
> formally.
> 
> CP   character               Table A derived value
> 
> 0061 lowercase-a             PVALID
> 0041 uppercase-A             PVALID
> 00E1 lowercase-a-acute       PVALID
> 00C1 uppercase-A-acute       PVALID
> 00DF lowercase-sharp-s       PVALID
> 1E9E uppercase-sharp-S       PVALID
> 03B1 lowercase-alpha         PVALID
> 0391 uppercase-Alpha         PVALID
> 03C2 lowercase-final-sigma   PVALID
> 03C3 lowercase-sigma         PVALID
> 03A3 uppercase-Sigma         PVALID
> 0131 lowercase-dotless-i     PVALID
> 0130 uppercase-I-with-dot    CONTEXTO (by exception)
> 
> In idnabis-tables, you further simplify as follows:
> 
> The rule Unstable (B) gets simplified to:
> 
> toNFKC(cp) != cp
> 
> All it needs to do is eliminate all the *normalization*
> unstable characters, and no longer needs to also deal
> with case folding at the same time.
> 
> 00DF lowercase-sharp-s and 03C2 lowercase-final-sigma
> no longer need to be in the Exceptions (F) list, because
> they aren't removed by a CaseFolding criterion in the
> first place.
> 
> The idnabis-tables spec can also entirely dispense with
> the LDH (E) rule, because 002D is already in the Exceptions
> (F) list, there is nothing exceptional now about a-z,
> and 0-9 get handled just like other digits, automatically.
> 
> If the protocol then requires, both for registration
> and lookup, the application of toLowerCase(), you end
> up with precisely the correct results for everything,
> with 3 exceptions. I don't think you need to guess that
> they are:
> 
> German esszet casing.
> 
> Greek final sigma casing.
> 
> Turkish i casing.
> 
> For esszet and final sigma, I don't think you need to do
> anything other than you are already doing. Those characters
> end up PVALID (automatically now, rather than by exception
> table), but you still need the registries to be aware
> of special issues when registering labels using them.
> 
> Only Turkish i casing is really hard -- but that is no
> different than the situation you were already in for that.
> My suggestion for Turkish i, in the spirit of the other
> special character requirements for the CONTEXTJ characters
> and the other few exceptions the working group wants to
> allow in, would be to specify that U+0130 LATIN CAPITAL
> LETTER I WITH DOT ABOVE be made CONTEXTO in the exceptions
> table, and then figure out how to specify a workable
> Turkish casing rule for it in the context rules.
> In other words, if you cannot do Turkish casing correctly,
> in the right context, then you don't use U+0130, period.
> 
> And even *if* the specification of a context rule for
> Turkish casing turns out to be problematical, you are in
> a no worse situation than simply throwing up your
> hands, metaphorically, and leaving Turkish i casing as
> a player to be designated later, so to speak, while
> opening the protocol up to unknown and essentially
> unknowable attacks by people attempting mischievous
> local mappings.
> 
> So here is a hearty endorsement for taking this direction,
> which I think is a win-win-win-win, as they say. It
> simplifies the table derivation, it makes everything
> easier to understand, it better matches user expectations
> (avoiding the astonishment problem), and it avoids
> the need for the ominous security hole of "local mapping".
> 
> --Ken
> 
>