HTTP and IDN, was RE: Nameprep input vs output

Sun Jan 14 08:48:53 CET 2007

At 02:55 07/01/14, Erik van der Poel wrote:
>Hi Martin,
>
>Thanks for the reply.
>
>On 1/12/07, Martin Duerst <duerst at it.aoyama.ac.jp> wrote:
>> I basically agree with Michel. I wouldn't go so far as calling
>> using raw octets in HTTP request URIs a 'new protocol', because
>
>I'm not sure where you're getting this from. Did someone say something
>about raw octets?

I made some leaps. Michel wrote:

>>>>
I have not been following discussion about an HTTP RFC update activity. If native IDN and in general non ASCII characters were added to HTTP, it really relates to the discussion of using IRI as protocol elements for a new scheme (not really HTTP anymore).
>>>>

So I was changing 'new scheme' to 'new protocol' (not intended),
but I was also changing 'non ASCII characters' to 'raw octets',
which was intended, because HTTP basically is working with
octets, not characters, for things such as path and query part.
Of course, an update could say that these have to use %-encoding
when they are not UTF-8, but can use raw octets when they are
UTF-8.

>> there is at least some annectotal evidence that it works currently
>> in some cases, but my understanding is that the current effort
>> on the HTTP spec is to move that spec to full IETF Standard,
>> which includes late and drastic changes.
>
>I heard that they are not planning to make any drastic changes.

Sorry, it should have read 'excludes', not 'includes'.

>> The more I thought about the full-width w and the fl ligature examples,
>> the more I came to the conclusion that these are just garbage that
>> we should ignore. Writing just one of the three 'w' in www. with
>> a full-width character can only have happened by accident. And it
>> probably wasn't tested at all, because on IE6 (still the most depoloyed
>> browser), it just won't work, and I can't immagine that's what
>> the creators intended. It might make sense for a real IDN, but
>> not for something that's otherwise all ASCII. Almost the same
>> considerations apply to the fl ligature. People have used the
>> Web widely for more than 10 years now, and have managed without
>> the fl ligature.
>
>Yes, these are accidents/garbage. The problem is that MSIE 7, Firefox
>and the Verisign i-Nav plugin for MSIE 6 all accept this garbage. As
>you know from the history of HTML, when user agents are too liberal in
>what they accept, garbage can become entrenched and difficult to
>remove.

Yes. So let's try and move forward with our work.

>> >In the IRI RFC, you were forced to acknowledge the existence of legacy
>> >HTTP servers that only accept paths or queries in legacy encodings
>> >like Big5 or iso-8859-1. See the bottom of the 3rd paragraph in
>> >section 6.4 of RFC 3987 (IRI):
>> >
>> >http://ietf.org/rfc/rfc3987.txt
>>
>> The "were forced to acknowledge" isn't appropriate. The whole
>> IRI spec is built on the foundation that we don't want to restrict
>> what URIs people can use; we only made some of them a lot more
>> legible as IRIs.
>
>Yes, it's great work. I read the entire RFC and will be sending
>comments and errata.

Thanks. Please send them to public-iri at w3.org.

>But I stand by my assertion that you felt you had
>to mention the legacy character encodings in URIs (in several parts of
>the IRI RFC), because otherwise implementors might blindly
>percent-decode, convert to UTF-8 and percent-encode again, yielding a
>URI that does not work when sent to the legacy HTTP server.

Blindly percent-decoding, converting to UTF-8 (from what encoding?),
and percent-encoding again would indeed be a bad idea, even if there
were no legacy character encodings used in URIs.

>> My personal preference would be that user agents don't touch the
>> URIs or IRIs in HTML with respect to NFC/KC, except for cases
>> like windows-1258 (Vietnamese) where the input in guaranteed to
>> not translate to NFC one-to-one. On the other hand, what they
>
>Windows-1258 appears not to be so common on the Web,

Interesting, and glad to hear that. Do you have statistics?

>but I agree that there are normalization issues there.

I'd be extremely surprised if anybody disagreed with us :-).

Regards,     Martin.

>Anyway, my point is that, if the implementors don't change their
>implementations to reject full-width w and the like in URIs in HTML
>soon, we may eventually find that we feel that we have to describe
>this by-then-legacy behavior in some descriptive spec (as opposed to a
>prescriptive spec).
>
>> do with URIs/IRIs input into the address bar I think is their
>> business. If they can add http:// and www. to the front, and
>> .com to the end of what somebody typed, I don't see why we
>> would be able to prohibit them doing some normalizations if
>> they think it helps the user.
>
>I agree.
>
>Erik
>_______________________________________________
>Idna-update mailing list
>Idna-update at alvestrand.no
>http://www.alvestrand.no/mailman/listinfo/idna-update

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst at it.aoyama.ac.jp