AW: AW: sharp s (Eszett)
mark.davis at icu-project.org
Tue Mar 11 20:12:14 CET 2008
It is not just a matter of "typographic convenience": the recognized
standard German uppercase of "ß" *is* "SS". Unicode did not invent this
relationship -- it is just following recognized German standards. In German
orthography ß is not just an ordinary letter like any other. If in normal
use ß were caseless, or if ß had a unique uppercase, we wouldn't be having
this discussion. But it is not normal. And the previous behavior in IDNA2003
can't be simply discarded. There are two main issues:
*1. IDNA compatibility. *Right now, all of the following point to the same
website. If we make this exception for ß, then they won't.
This is not just a UI issue, since the URLs above can be in all sorts of
data (email, webpages, etc). And even if IDNA200x comes out soon, data and
programs exist, that will only slowly be updated. So for an extended,
perhaps indefinite, amount of time browsers and search engines (like ours at
Google) will need to handle both IDNA2003 and IDNA200x URLs. When the
results under each system point to different places, that is a significant
problem and possible security issue.
*2. Case insensitivity. *If we make this exception, then uppercasing a
domain name causes it to go to a different place. Even if there were no
compatibility issue, there is still the issue of whether it is more
important to have ß or to have case-insensitivity.
While it would be possible to have an exception for ß, both of these issues
need to be considered very carefully, and we should not make any decision
lightly. Any proposal for an exception for ß really should get consensus
from a broad set of stakeholders, including DENIC, NIC.AT, and SWITCH, as
well as the standards bodies DIN, ÖN, and SNV.
On Tue, Mar 11, 2008 at 8:05 AM, John C Klensin <klensin at jck.com> wrote:
> Thanks for this analysis.
> I'd like to make a case for your Alternative 2 (I do not fully
> understand the implications of Alternative 3). Let me try to
> state it as a principle because, as you know, I hate special
> cases. If we have to _implement_ that principle as a special
> case, that is a separate, and less significant, matter.
> I think that we need to accept the fact that there are some
> transformations and rules that work well for Unicode that do not
> work well for domain names. We also have a tool available to
> us (or the relevant registries) in the form of the JET "variant"
> model and its extensions and extrapolations (see, e.g., RFC 3473
> and 4290) that is obvious not applicable to general Unicode
> comparison use.
> Our general rule should be to avoid information loss. While we
> assume both for historical reasons and due to the symmetry of
> case folding them, that there is no information loss in
> transforming ordinary Latin upper case characters into lower
> case ones, when we start talking about non-reversible mappings
> and and spelling rules, we should keep the characters separate
> and registerable, rather than taking excursions into the
> peculiarities of casefolding (which the Unicode book
> acknowledges loses information and recommends against precisely
> the way we are using it if that can be avoided).
> This reasoning suggests that the
> ß -> ss
> mapping is really not, for our purposes, as case folding issue
> but a typographic convenience (acceptable in some places,
> distasteful in others and certainly not reversible) that is
> really no different from, e.g.,
> ö -> oe
> (acceptable in some places, distasteful in others, plain wrong
> in still others, and certainly not reversible).
> If a particular registry wants to treat these as the same, let
> them use variants. If we discard the information and treat
> them the same as a matter of protocol, there is no way for the
> registry to "fix" that. Maybe that distinction, for cases in
> which information loss actually occurs, is ultimately the
> compelling argument.
> Now, if we can agree on that principle, we need to examine the
> set of characters that are transformed in non-obvious ways by
> case folding. For those that are like this, we have to figure
> out how to implement the principle, which may require some
> additions to the exception list.
> --On Tuesday, 11 March, 2008 10:20 -0400 Harald Tveit Alvestrand
> <harald at alvestrand.no> wrote:
> > --On Tuesday, March 11, 2008 11:29:30 +0100 Georg Ochsner
> > <g.ochsner at revolistic.com> wrote:
> >>> -----Ursprüngliche Nachricht-----
> >>> Von: Kenneth Whistler
> >>> Gesendet: Montag, 10. März 2008 21:55
> >>> Jelte Jansen said:
> >>> > So if IDNAbis would make an exception for the sharp S, and
> >>> > allow it as a separate symbol, would there be people
> >>> > running to their lawyers because they think it's
> >>> > equivalent to ss too ...
> >>> It *is* equivalent to ss, too. It just depends on what level
> >>> and type of equivalence you are talking about. They aren't
> >>> equivalent for spelling, obviously -- but they *are*
> >>> equivalent for some types of searching and sorting.
> >> Why should they be equivalent in general just because there
> >> are equivalences for "some types of searching and sorting"?
> > Note: We're not talking about "equivalent in general" here,
> > we're talking about "permitted as a lookup key in the DNS".
> > Under IDNA2003, we were talking about "if the user gives us
> > ß, we will look up ss". Under IDNA200x, we're only talking
> > about "can ß be used to lookup information in the DNS?",
> > since mapping user expectations to lookup keys is considered
> > to be outside the protocol.
> > We have four alternatives:
> > 1 - No, it can't
> > 2 - Yes, it can, because we're adding a special case
> > 3 - Yes, it can, and because of this example, we'll throw away
> > the requirement that a character be stable under full
> > casefolding in order to be used as lookup
> > 4 - Yes, it can, and the Unicode tables should be changed to
> > make it stable under casefolding.
> > 1 is what's currently proposed, 2 and 3 can be accomplished by
> > changes to the idnabis documents (noting that 3 can have
> > unforeseen effects on other characters), 4 requires changes to
> > Unicode.
> > Note that if you're arguing the 4th case, that belongs on the
> > unicode at unicode list, not here.
> > Harald
> > _______________________________________________
> > Idna-update mailing list
> > Idna-update at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/idna-update
> Idna-update mailing list
> Idna-update at alvestrand.no
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Idna-update