HTTP and IDN, was RE: Nameprep input vs output

Sat Jan 13 06:37:08 CET 2007

Hello Erik,

I basically agree with Michel. I wouldn't go so far as calling
using raw octets in HTTP request URIs a 'new protocol', because
there is at least some annectotal evidence that it works currently
in some cases, but my understanding is that the current effort
on the HTTP spec is to move that spec to full IETF Standard,
which includes late and drastic changes.

At 21:37 07/01/11, Erik van der Poel wrote:
>Hi Michel,
>
>Thanks for the reply!
>
>I personally don't feel that this is seriously off-topic, since many
>IDNA implementors will have questions related to this. The way I see
>it, the current use (in IDNs in HTML) of Unicodes with compatibility
>decompositions (such as full-width w and fl ligature) is there for
>historical reasons, and it may be too late to get rid of this usage.

The more I thought about the full-width w and the fl ligature examples,
the more I came to the conclusion that these are just garbage that
we should ignore. Writing just one of the three 'w' in www. with
a full-width character can only have happened by accident. And it
probably wasn't tested at all, because on IE6 (still the most depoloyed
browser), it just won't work, and I can't immagine that's what
the creators intended. It might make sense for a real IDN, but
not for something that's otherwise all ASCII. Almost the same
considerations apply to the fl ligature. People have used the
Web widely for more than 10 years now, and have managed without
the fl ligature.

>In the IRI RFC, you were forced to acknowledge the existence of legacy
>HTTP servers that only accept paths or queries in legacy encodings
>like Big5 or iso-8859-1. See the bottom of the 3rd paragraph in
>section 6.4 of RFC 3987 (IRI):
>
>http://ietf.org/rfc/rfc3987.txt

The "were forced to acknowledge" isn't appropriate. The whole
IRI spec is built on the foundation that we don't want to restrict
what URIs people can use; we only made some of them a lot more
legible as IRIs.

>A new HTTP RFC or HTML spec may likewise be forced to acknowledge the
>existence of user agents that perform some NFKC mappings for IDNs. I
>could be wrong, of course, since there are so few IDNs on the Web at
>the moment. If we all agree that we need to get rid of these NFKC
>mappings and the implementors actually heed our recommendations, we
>may be able to stamp these out.

My personal preference would be that user agents don't touch the
URIs or IRIs in HTML with respect to NFC/KC, except for cases
like windows-1258 (Vietnamese) where the input in guaranteed to
not translate to NFC one-to-one. On the other hand, what they
do with URIs/IRIs input into the address bar I think is their
business. If they can add http:// and www. to the front, and
.com to the end of what somebody typed, I don't see why we
would be able to prohibit them doing some normalizations if
they think it helps the user.

Regards,     Martin.

>This is why I changed the Subject header to "Nameprep input vs
>output". I.e. the Web is currently using Nameprep input in HTTP links
>in HTML.
>
>Erik
>
>On 1/11/07, Michel Suignard <michelsu at windows.microsoft.com> wrote:
>> >
>> >As you know, there are already HTML documents on the Web that include
>> >HTTP URIs that use IDNs (and MSIE 7 supports them). I have heard
>> >rumors that an HTTP RFC update activity may have started. Do you know
>> >whether that is true and whether there is anyone there to discuss the
>> >addition of IDNs to the spec for HTTP URIs (or should I say IRIs)?
>>
>> Hi Erik,
>> In HTTP URIs, IDNs should only exist in Punycode notation (but they may also be % encoded). If a 'HTTP URI' contains IDN in native form you are really dealing with IRIs which can be handled as presentation forms of the underlying and equivalent URIs. The IRI RFC was drafted to make easy for user agent to process URI and IRI that way. There is much more details on the IRI RFC (3987). I encourage you to read the text and raise any issues you may find. Martin is also on this list and will also be interested, I am sure.
>>
>> I have not been following discussion about an HTTP RFC update activity. If native IDN and in general non ASCII characters were added to HTTP, it really relates to the discussion of using IRI as protocol elements for a new scheme (not really HTTP anymore). Keeping the IRI at the presentation layer as of today while still maintaining http as we know it for the core protocol/scheme seems to me prudent, but we are getting seriously OT here.
>>
>> On another hand, the discussion in this list may have some consequence for IRI, especially concerning bidirectional issue as IRI uses the stringprep bidi restriction almost word by word.
>>
>> Michel
>>
>_______________________________________________
>Idna-update mailing list
>Idna-update at alvestrand.no
>http://www.alvestrand.no/mailman/listinfo/idna-update

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp