local mappings

Sat Jan 24 05:46:17 CET 2009

Quick reaction before coffee:

First, As I have said many times, I support specifications of mapping  
rules.

But, I do really see a need between what is mapped to something and  
what is possible to use in DNS.

Because of this, I do not as an immediate reaction like having both  
uppercase A and lowercase having the same property, PVALID.

Create a new property, and divide rule Stable in two, and I think I am  
with you.

    Patrik

24 jan 2009 kl. 02.47 skrev Kenneth Whistler <kenw at sybase.com>:

> John,
>
> Skipping over the long discussion of the wheres and whyfores,
> and getting to what I personally consider the meat of the
> question at hand:
>
>> So, for example, we can reasonably expect that, for scripts with
>> case, users will be astonished if upper case characters don't
>> map to lower case ones.
>
> IMO, yes.
>
>> Could we agree to apply a lower-case
>> mapping globally
>
> OMG, yes!
>
>> -- lower-case and not CaseFold, because
>> CaseFold is designed for comparison and not well-suited to
>> mapping (more or less quoting from TUS on that) and because the
>> subtle properties of CaseFold in the cases in which it doesn't
>> produce what LowerCase produces are themselves astonishing to
>> the unsophisticated)?
>
> Actually, the only "astonishing" instances for CaseFold are
> precisely the problematical instances that everybody knows
> about, and which this list has been stumbling over for the
> protocol: German esszet, Greek final sigma, and Turkish i's.
> Since they *are* exceptions to the general rules, it
> shouldn't really be a surprise to anybody if the protocol
> handles them as special cases.
>
>>   If we do lower-case, but continue to
>> ban compatibility characters and the other odd cases that
>> surprise those who don't know what is going on, does that help
>> us significantly with the compatibility and astonishment
>> situations that are really important?
>
> Yes, absolutely. If the group is serious about this, this
> can very much clear up the perceived complexity in the
> documents, as well as better matching the expectations of
> both protocol implementers and end users, IMO. *And* it
> gives you a way to avoid the entire morass of opening up
> the protocol to the mercy of undefined "local mappings".
>
>> I don't think there are
>> many other situations similar to the lower-case one (which I
>> assume is why it keeps coming up in examples), but need advice
>> from Mark, Ken, and others as to whether there are any others
>> and what they are.
>
> Nor do I. I'll explain what I think the implications are
> below.
>
>> And I can only hope that even suggesting
>> this doesn't open cans of worms and arguments about which
>> mappings are more important than others.
>
> It shouldn't. You can end up with a simpler and more elegant
> protocol this way, *and* avoid the security problem everybody
> is so rightfully concerned about.
>
> O.k., details.
>
> Currently we have the following specification, using
> exemplary letters:
>
> CP   character               Table A derived value
>
> 0061 lowercase-a             PVALID
> 0041 uppercase-A             DISALLOWED
> 00E1 lowercase-a-acute       PVALID
> 00C1 uppercase-A-acute       DISALLOWED
> 00DF lowercase-sharp-s       PVALID (by exception)
> 1E9E uppercase-sharp-S       DISALLOWED
> 03B1 lowercase-alpha         PVALID
> 0391 uppercase-Alpha         DISALLOWED
> 03C2 lowercase-final-sigma   PVALID (by exception)
> 03C3 lowercase-sigma         PVALID
> 03A3 uppercase-Sigma         DISALLOWED
> 0131 lowercase-dotless-i     PVALID
> 0130 uppercase-I-with-dot    DISALLOWED
>
> The uppercase letters all end up DISALLOWED as the result
> of the complex and hard to explain rule Unstable (B):
>
> toNFKC(toCasefold(toNFKC(cp))) != cp
>
> Sharp-s and lowercase-final-sigma would be DISALLOWED by
> the rule, so to match apparent user requirements, the
> table spec has to stick them in the Exceptions rule (F).
>
> And then to deal with the one screwy casing issue left,
> the Turkish I casing issue, the protocol opens itself
> up to local mappings to allow the possibility for
> 0130 uppercase-I-with-dot and its unusual case pairings
> to work for Turkish, despite the fact that the table
> gives 0130 a DISALLOWED value.
>
> Furthermore, to get labels to behave the way people expect
> them to in browsers, etc., everybody is depending on
> the *externally* defined definition of case folding,
> even though to build the derivation for Table A, the
> protocol is internally depending on toCaseFold, anyway.
>
> O.k., now here is how this will all look in the specification,
> *if* you make toLowerCase() a part of the specification
> formally.
>
> CP   character               Table A derived value
>
> 0061 lowercase-a             PVALID
> 0041 uppercase-A             PVALID
> 00E1 lowercase-a-acute       PVALID
> 00C1 uppercase-A-acute       PVALID
> 00DF lowercase-sharp-s       PVALID
> 1E9E uppercase-sharp-S       PVALID
> 03B1 lowercase-alpha         PVALID
> 0391 uppercase-Alpha         PVALID
> 03C2 lowercase-final-sigma   PVALID
> 03C3 lowercase-sigma         PVALID
> 03A3 uppercase-Sigma         PVALID
> 0131 lowercase-dotless-i     PVALID
> 0130 uppercase-I-with-dot    CONTEXTO (by exception)
>
> In idnabis-tables, you further simplify as follows:
>
> The rule Unstable (B) gets simplified to:
>
> toNFKC(cp) != cp
>
> All it needs to do is eliminate all the *normalization*
> unstable characters, and no longer needs to also deal
> with case folding at the same time.
>
> 00DF lowercase-sharp-s and 03C2 lowercase-final-sigma
> no longer need to be in the Exceptions (F) list, because
> they aren't removed by a CaseFolding criterion in the
> first place.
>
> The idnabis-tables spec can also entirely dispense with
> the LDH (E) rule, because 002D is already in the Exceptions
> (F) list, there is nothing exceptional now about a-z,
> and 0-9 get handled just like other digits, automatically.
>
> If the protocol then requires, both for registration
> and lookup, the application of toLowerCase(), you end
> up with precisely the correct results for everything,
> with 3 exceptions. I don't think you need to guess that
> they are:
>
> German esszet casing.
>
> Greek final sigma casing.
>
> Turkish i casing.
>
> For esszet and final sigma, I don't think you need to do
> anything other than you are already doing. Those characters
> end up PVALID (automatically now, rather than by exception
> table), but you still need the registries to be aware
> of special issues when registering labels using them.
>
> Only Turkish i casing is really hard -- but that is no
> different than the situation you were already in for that.
> My suggestion for Turkish i, in the spirit of the other
> special character requirements for the CONTEXTJ characters
> and the other few exceptions the working group wants to
> allow in, would be to specify that U+0130 LATIN CAPITAL
> LETTER I WITH DOT ABOVE be made CONTEXTO in the exceptions
> table, and then figure out how to specify a workable
> Turkish casing rule for it in the context rules.
> In other words, if you cannot do Turkish casing correctly,
> in the right context, then you don't use U+0130, period.
>
> And even *if* the specification of a context rule for
> Turkish casing turns out to be problematical, you are in
> a no worse situation than simply throwing up your
> hands, metaphorically, and leaving Turkish i casing as
> a player to be designated later, so to speak, while
> opening the protocol up to unknown and essentially
> unknowable attacks by people attempting mischievous
> local mappings.
>
> So here is a hearty endorsement for taking this direction,
> which I think is a win-win-win-win, as they say. It
> simplifies the table derivation, it makes everything
> easier to understand, it better matches user expectations
> (avoiding the astonishment problem), and it avoids
> the need for the ominous security hole of "local mapping".
>
> --Ken
>
>
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>