Re: idna-bis and 'ß'
John C Klensin
klensin at jck.com
Fri Nov 23 19:36:28 CET 2007
--On Friday, November 23, 2007 7:06 AM -0800 Paul Hoffman
<phoffman at imc.org> wrote:
> At 6:07 PM -0500 7/26/07, Thomas Roessler wrote:
>> In the current IDNA environment, the 'ß' character (latin
>> small letter sharp s) is mapped to the all-Latin string "ss".
>> Therefore, no traces of that character can be found in zone
>> files or registration databases; however, references to
>> domain names (even all-ASCII ones) can be written in that
>> fashion, and presented on business cards, in printed
>> material, and in IRIs.
>>
>> I understand that the current state of idna-bis would lead to
>> treating 'ß' as a "usual" character, causing a different
>> encoding of relevant input strings. Taking that step in
>> idna-bis would cause existing references written with the
>> 'ß' character (and possibly resolving to all-ASCII domain
>> names) to break.
>>
>> As a cure, the current mapping behavior should be preserved
>> as an exception.
>
> To Patrik, John, et. al.: Is there a list of characters from
> IDNA that are mapped to ASCII that are proposed to map to
> non-ASCII in idnabis?
To the best of my knowledge, if your question is read precisely,
there is exactly one such character, that one is Eszett (ß).
There are a large collection of characters, dominated by the
"mathematical" font and style variations, that are mapped to
ordinary ASCII by the compatibility mappings of NFKC and
IDNA2003, but IDNAbis rejects them entirely, leaving any
mappings to a user interface issue.
The problems arise in case-mapping rather than ordinary
normalization. Under the IDNAbis rules, case mapping is also a
user interface issue, with upper case characters being
prohibited in the protocol because they cannot be stored in the
DNS without loss of information. However, eliminating the
requirement that ToASCII perform case mapping eliminates the
requirement that characters that exist in lower case only be
mapped to something special. Lower case characters are just
characters.
There are few such characters and IDNA2003 does not handle them
consistently (because Unicode doesn't).
Given the IDNAbis model --without making any real exceptions at
all-- Eszett (and Kra (U+0138) and whatever else is out there)
can be handled in either of two ways:
(i) They can be banned entirely, leaving any mapping to be a
user interface issue.
(ii) They can be treated as ordinary lower-case characters.
I don't have a strong opinion. Indeed, I have almost no opinion
at all. We selected the second because it seemed logical
("logical" in this strange world of charsets, IDNs, and
compatibility doesn't imply "right") and because we were hearing
what seemed like considerable outcry to actually permit Eszett
in labels, i.e., to treat it as a normal character. If
compatibility is more important than having the character, then
it can easily be banned. The German-speaking/using community
just has to somehow figure out what they consider the best
answer and tell us.
The option to which I would, personally, be opposed is to
re-introduce mappings. If mapped to "ss", Eszett would become
the only character that we are mapping into another sequence
that is not covered by NFC and where the original character is
not recoverable from the A-label form. The input we have
gotten, and the confusion we have seen, is that recoverability
at the protocol level turns out to be fairly important and its
absence a source of confusion (independent of whatever UIs might
do).
john
More information about the Idna-update
mailing list