AW: AW: AW: sharp s (Eszett)

John C Klensin klensin at jck.com
Tue Mar 18 12:24:30 CET 2008



--On Monday, 17 March, 2008 18:41 -0700 Paul Hoffman
<phoffman at imc.org> wrote:

> At 7:36 PM -0400 3/17/08, John C Klensin wrote:
>> Yes.  And because of how IDNA2003 works, nothing can talk you
>> to "maßlos.de" because, regardless of what can be typed, that
>> string (or, more specifically, its ACE equivalent) cannot be
>> registered.  Note that, even if one uses Maßlos.DE to get to
>> it, the result is not an IDN (its DNS form is the ASCII
>> "masslos.de", not a punycode-encoded ACE) which is the source
>> of even more confusion.
> 
> The phrase "nothing can talk to" seems wrong. Entering
> maßlos.de into a browser location bar correctly converts to
> masslos.de, and the browser can "talk to" it.

> Did you mean something different?

Sorry, typographical error that you resolved differently than I
intended.   I intended not "talk to" but "take you to".  One
cannot get to punycode(maßlos) (although it is perfectly
well-defined) because ToASCII(maßlos) -> masslos

However, as a more general observation, "correctly" above == "as
IDNA2003 defines 'correct'".  Not as the formal/ official
authorities on the German language (in Germany) define
"correct", because they appear to insist that maßlos is
potentially a different word from masslos.    Now, we can
_decide_ that they will be treated as the same or that "ß"
should be disallowed entirely because the benefits are just not
worth it.   In principle --although the advantages would be
fewer and the costs higher-- we could decide that "j" should be
disallowed or somehow treated as the same as "i" or that the
problematic dotless i should be disallowed entirely.  Getting
rid of "j" or dotless i as distinct characters would work in
some languages that use Roman-derived scripts.   Of course,
doing so would lose information --or prohibit the registration
of some significant words (remember, we don't register "words",
we register mnemonic label strings that might happen to be words
of some language)-- but that is the case for Eszett too.

I stress "decide" above because I don't see us as inexorably
bound to the particular mappings of Unicode case folding,
especially since that standards encourages the use of local
variations when needed.  I see "what does case folding do?" as
valuable input and a default that works well in most cases, but
not as a set of rules that have to be followed because they are
there.  And my view of Eszett in IDNA2003 is that we made a
mistake.  The mistake was not that it is mapped, but that we
didn't consider it, and a few other unusual cases (including the
few final-form characters that appear in scripts that also make
case distinctions and that are coded as separate code points)
much more carefully.

As I've tried to say before, I don't have a clue as to how we
resolve those issues at this stage, and I do believe that
backward compatibility and transition need to be considered
carefully, but I don't think repeating ourselves more times is a
plausible path to doing so.

    john






More information about the Idna-update mailing list