AW: AW: AW: sharp s (Eszett)
John C Klensin
klensin at jck.com
Tue Mar 18 12:24:30 CET 2008
--On Monday, 17 March, 2008 18:41 -0700 Paul Hoffman
<phoffman at imc.org> wrote:
> At 7:36 PM -0400 3/17/08, John C Klensin wrote:
>> Yes. And because of how IDNA2003 works, nothing can talk you
>> to "maßlos.de" because, regardless of what can be typed, that
>> string (or, more specifically, its ACE equivalent) cannot be
>> registered. Note that, even if one uses Maßlos.DE to get to
>> it, the result is not an IDN (its DNS form is the ASCII
>> "masslos.de", not a punycode-encoded ACE) which is the source
>> of even more confusion.
>
> The phrase "nothing can talk to" seems wrong. Entering
> maßlos.de into a browser location bar correctly converts to
> masslos.de, and the browser can "talk to" it.
> Did you mean something different?
Sorry, typographical error that you resolved differently than I
intended. I intended not "talk to" but "take you to". One
cannot get to punycode(maßlos) (although it is perfectly
well-defined) because ToASCII(maßlos) -> masslos
However, as a more general observation, "correctly" above == "as
IDNA2003 defines 'correct'". Not as the formal/ official
authorities on the German language (in Germany) define
"correct", because they appear to insist that maßlos is
potentially a different word from masslos. Now, we can
_decide_ that they will be treated as the same or that "ß"
should be disallowed entirely because the benefits are just not
worth it. In principle --although the advantages would be
fewer and the costs higher-- we could decide that "j" should be
disallowed or somehow treated as the same as "i" or that the
problematic dotless i should be disallowed entirely. Getting
rid of "j" or dotless i as distinct characters would work in
some languages that use Roman-derived scripts. Of course,
doing so would lose information --or prohibit the registration
of some significant words (remember, we don't register "words",
we register mnemonic label strings that might happen to be words
of some language)-- but that is the case for Eszett too.
I stress "decide" above because I don't see us as inexorably
bound to the particular mappings of Unicode case folding,
especially since that standards encourages the use of local
variations when needed. I see "what does case folding do?" as
valuable input and a default that works well in most cases, but
not as a set of rules that have to be followed because they are
there. And my view of Eszett in IDNA2003 is that we made a
mistake. The mistake was not that it is mapped, but that we
didn't consider it, and a few other unusual cases (including the
few final-form characters that appear in scripts that also make
case distinctions and that are coded as separate code points)
much more carefully.
As I've tried to say before, I don't have a clue as to how we
resolve those issues at this stage, and I do believe that
backward compatibility and transition need to be considered
carefully, but I don't think repeating ourselves more times is a
plausible path to doing so.
john
More information about the Idna-update
mailing list