AW: AW: sharp s (Eszett)

Tue Mar 11 16:05:22 CET 2008

(Harald,

Thanks for this analysis.

I'd like to make a case for your Alternative 2 (I do not fully
understand the implications of Alternative 3).  Let me try to
state it as a principle because, as you know, I hate special
cases.  If we have to _implement_ that principle as a special
case, that is a separate, and less significant, matter.

I think that we need to accept the fact that there are some
transformations and rules that work well for Unicode that do not
work well for domain names.   We also have a tool available to
us (or the relevant registries) in the form of the JET "variant"
model and its extensions and extrapolations (see, e.g., RFC 3473
and 4290) that is obvious not applicable to general Unicode
comparison use.

Our general rule should be to avoid information loss.  While we
assume both for historical reasons and due to the symmetry of
case folding them, that there is no information loss in
transforming ordinary Latin upper case characters into lower
case ones, when we start talking about non-reversible mappings
and and spelling rules, we should keep the characters separate
and registerable, rather than taking excursions into the
peculiarities of casefolding (which the Unicode book
acknowledges loses information and recommends against precisely
the way we are using it if that can be avoided).  

This reasoning suggests that the 
   ß -> ss 
mapping is really not, for our purposes, as case folding issue
but a typographic convenience (acceptable in some places,
distasteful in others and certainly not reversible) that is
really no different from, e.g.,
   ö -> oe
(acceptable in some places, distasteful in others, plain wrong
in still others, and certainly not reversible).

If a particular registry wants to treat these as the same, let
them use variants.   If we discard the information and treat
them the same as a matter of protocol, there is no way for the
registry to "fix" that.  Maybe that distinction, for cases in
which information loss actually occurs, is ultimately the
compelling argument.

Now, if we can agree on that principle, we need to examine the
set of characters that are transformed in non-obvious ways by
case folding.   For those that are like this, we have to figure
out how to implement the principle, which may require some
additions to the exception list.

     john

--On Tuesday, 11 March, 2008 10:20 -0400 Harald Tveit Alvestrand
<harald at alvestrand.no> wrote:

> 
> 
> --On Tuesday, March 11, 2008 11:29:30 +0100 Georg Ochsner
> <g.ochsner at revolistic.com> wrote:
> 
>>> -----Ursprüngliche Nachricht-----
>>> Von: Kenneth Whistler
>>> Gesendet: Montag, 10. März 2008 21:55
>> 
>>> Jelte Jansen said:
>> 
>>> > So if IDNAbis would make an exception for the sharp S, and
>>> > allow it as a separate symbol, would there be people
>>> > running to their lawyers because they think it's
>>> > equivalent to ss too ...
>>> 
>>> It *is* equivalent to ss, too. It just depends on what level
>>> and type of equivalence you are talking about. They aren't
>>> equivalent for spelling, obviously -- but they *are*
>>> equivalent for some types of searching and sorting.
>> 
>> Why should they be equivalent in general just because there
>> are equivalences for "some types of searching and sorting"?
> 
> Note: We're not talking about "equivalent in general" here,
> we're talking about "permitted as a lookup key in the DNS".
> 
> Under IDNA2003, we were talking about "if the user gives us
> ß, we will look up ss". Under IDNA200x, we're only talking
> about "can ß be used to lookup information in the DNS?",
> since mapping user expectations to lookup keys is considered
> to be outside the protocol.
> 
> We have four alternatives:
> 
> 1 - No, it can't
> 2 - Yes, it can, because we're adding a special case
> 3 - Yes, it can, and because of this example, we'll throw away
> the requirement that a character be stable under full
> casefolding in order to be used as lookup
> 4 - Yes, it can, and the Unicode tables should be changed to
> make it stable under casefolding.
> 
> 1 is what's currently proposed, 2 and 3 can be accomplished by
> changes to the idnabis documents (noting that 3 can have
> unforeseen effects on other characters), 4 requires changes to
> Unicode.
> 
> Note that if you're arguing the 4th case, that belongs on the
> unicode at unicode list, not here.
> 
>                       Harald
> 
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update