Re: idna-bis and 'ß'

Fri Nov 23 19:36:28 CET 2007

--On Friday, November 23, 2007 7:06 AM -0800 Paul Hoffman 
<phoffman at imc.org> wrote:

> At 6:07 PM -0500 7/26/07, Thomas Roessler wrote:
>> In the current IDNA environment, the 'ß' character (latin
>> small letter sharp s) is mapped to the all-Latin string "ss".
>> Therefore, no traces of that character can be found in zone
>> files or registration databases; however, references to
>> domain names (even all-ASCII ones) can be written in that
>> fashion, and presented on business cards, in printed
>> material, and in IRIs.
>>
>> I understand that the current state of idna-bis would lead to
>> treating 'ß' as a "usual" character, causing a different
>> encoding of relevant input strings.  Taking that step in
>> idna-bis would cause existing references written with the
>> 'ß' character (and possibly resolving to all-ASCII domain
>> names) to break.
>>
>> As a cure, the current mapping behavior should be preserved
>> as an exception.
>
> To Patrik, John, et. al.: Is there a list of characters from
> IDNA that are mapped to ASCII that are proposed to map to
> non-ASCII in idnabis?

To the best of my knowledge, if your question is read precisely, 
there is exactly one such character, that one is Eszett (ß). 
There are a large collection of characters, dominated by the 
"mathematical" font and style variations, that are mapped to 
ordinary ASCII by the compatibility mappings of NFKC and 
IDNA2003, but IDNAbis rejects them entirely, leaving any 
mappings to a user interface issue.

The problems arise in case-mapping rather than ordinary 
normalization.  Under the IDNAbis rules, case mapping is also a 
user interface issue, with upper case characters being 
prohibited in the protocol because they cannot be stored in the 
DNS without loss of information.   However, eliminating the 
requirement that ToASCII perform case mapping eliminates the 
requirement that characters that exist in lower case only be 
mapped to something special.    Lower case characters are just 
characters.

There are few such characters and IDNA2003 does not handle them 
consistently (because Unicode doesn't).

Given the IDNAbis model --without making any real exceptions at 
all-- Eszett (and Kra (U+0138) and whatever else is out there) 
can be handled in either of two ways:

    (i) They can be banned entirely, leaving any mapping to be a
    user interface issue.

    (ii) They can be treated as ordinary lower-case characters.

I don't have a strong opinion.  Indeed, I have almost no opinion 
at all.   We selected the second because it seemed logical 
("logical" in this strange world of charsets, IDNs, and 
compatibility doesn't imply "right") and because we were hearing 
what seemed like considerable outcry to actually permit Eszett 
in labels, i.e., to treat it as a normal character.    If 
compatibility is more important than having the character, then 
it can easily be banned. The German-speaking/using community 
just has to somehow figure out what they consider the best 
answer and tell us.

The option to which I would, personally, be opposed is to 
re-introduce mappings.  If mapped to "ss", Eszett would become 
the only character that we are mapping into another sequence 
that is not covered by NFC and where the original character is 
not recoverable from the A-label form.   The input we have 
gotten, and the confusion we have seen, is that recoverability 
at the protocol level turns out to be fairly important and its 
absence a source of confusion (independent of whatever UIs might 
do).

    john