Re: idna-bis and '゜'

Tue Nov 27 08:13:13 CET 2007

At 08:06 07/11/27, Thomas Roessler wrote:
>On 2007-11-26 08:03:38 -0500, John C Klensin wrote:
>
>> Put differently, to the extent to which IRIs specify a user
>> interface behavior, it would be perfectly reasonable for the
>> IRI spec to specify that SHARP S should be mapped to some
>> other character or character sequence ("ss" by the
>> orthography rules of some German-speaking countries, "fs" by
>> appearance, "??s" (U+017F U+0073) by origin, etc.
>> Certainly, if it is to be mapped to anything but itself,
>> that needs to be specified.  
>
>That's, as you wrote in your earlier message, in fact only a
>smaller part of the Grand Plan to get rid of mappings.  While I
>sympathize with that plan, I worry that it might break
>references to domain names in existing documents (read: Web
>pages) -- in a place that doesn't really qualify as a user
>interface.

In existing Web documents, in notes that people have made
of pages they want to visit again, on business cards, on the
side of a bus, and so on. Compared to this, the tweaks to
Unicode Normalization that the Unicode consortium has made
over the years, and that some people closely involved in
the IETF have denounced in various ways, look extremely
benign. 

>While, in theory, it sounds attractive to finally treat the
>Turkish dotless i (and similar peculiarities) reasonably by
>dealing with them in a place where there is superior knowledge,
>it would appear that at least in the Web use case non-ASCII
>domain names will be processed in places where that knowledge
>has already been lost (i.e., the user's browser when it hits
>notepad-generated HTML content).  Even worse, the author's and
>the user's browser might not be interoperating when it comes to
>interpreting IRI references in content.
>
>Effectively, this would seem to imply that (much of) the
>nameprep mapping niceties would have to move from the IDN spec
>to the IRI spec and other specifications layered on top of it.

Yes indeed. There are more or less three ways to proceed:
a) Go back to do mapping in IDNA
b) Try to push that mapping into other places
c) Totally abandon mapping

Saying that something is an user-interface issue is easy,
but the implementation consequences are not easy at all.
With the current IDNA architecture, mapping happened at
a single place in the protocol stack. Any idna library
would do it, or it wouldn't want to call itself an idna
library. That leads to a consistent and predictable behavior
from a user viewpoint. Declaring mapping a "user-interface
issue" will lead to a confusing hodge-pogde, either at the
level of specs using idna (such as IRI) or at the implementation
level, or most probably both. Making the Turkish dotless
I work correctly on Turkish systems would be a good thing,
but nothing in the idnabis drafts is currently requiring that.
We risk to end up with hoplessly confused users because
some applications deal with IDNs in a case-sensitive manner,
while others deal with them in a case-insensitive manner,
and some of the Turkish users might get the 'right' behavior,
but others may not.

One way to get around this would be to add a sentence along
the following lines:
"Applications converting from U-labels to A-labels SHOULD
apply the mappings specified in IDNA2003, unless they know
that the users of their system expect something different."

Another way would be to openly admit that IDNs are lower-case
only, and recommend that no mapping be done anymore.

Anyway, the fact that mappings are abandoned from idna2003 to
idnabis is something that will leak, the same bad way punycode
has been leaking.

>> But it should not be an IDNA problem, especially since IRIs
>> might choose to map it differently in different contexts (I
>> don't need to remind either of you that tails are
>> case-sensitive so the IDNA2003 rules don't apply).
>
>More significantly, "interesting" processing of tails happens
>close to places where they were authored, so much of the
>concern goes away for that part of the URI anyway.

Ah, I see, with 'tails' you mean the path/query/fragment
parts of an URI.

First, please note that neither URIs nor IRIs map any of these
parts. Some applications that have to compare URIs or IRIs use
some knowledge about case-sensitivity (or its absence) when
comparing URIs or IRIs; details are discussed in section
6 of RFC 3986 and section 5 of RFC 3987.

Second, to have some parts of an URI or IRI be case-sensitive,
while other parts of it are case-insensitive, is perfectly fine.
What creates problems is to suddenly change the sensitivity of
a certain part.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp