HTTP and IDN, was RE: Nameprep input vs output

Sat Jan 13 18:55:25 CET 2007

Hi Martin,

Thanks for the reply.

On 1/12/07, Martin Duerst <duerst at it.aoyama.ac.jp> wrote:
> I basically agree with Michel. I wouldn't go so far as calling
> using raw octets in HTTP request URIs a 'new protocol', because

I'm not sure where you're getting this from. Did someone say something
about raw octets?

> there is at least some annectotal evidence that it works currently
> in some cases, but my understanding is that the current effort
> on the HTTP spec is to move that spec to full IETF Standard,
> which includes late and drastic changes.

I heard that they are not planning to make any drastic changes.

> The more I thought about the full-width w and the fl ligature examples,
> the more I came to the conclusion that these are just garbage that
> we should ignore. Writing just one of the three 'w' in www. with
> a full-width character can only have happened by accident. And it
> probably wasn't tested at all, because on IE6 (still the most depoloyed
> browser), it just won't work, and I can't immagine that's what
> the creators intended. It might make sense for a real IDN, but
> not for something that's otherwise all ASCII. Almost the same
> considerations apply to the fl ligature. People have used the
> Web widely for more than 10 years now, and have managed without
> the fl ligature.

Yes, these are accidents/garbage. The problem is that MSIE 7, Firefox
and the Verisign i-Nav plugin for MSIE 6 all accept this garbage. As
you know from the history of HTML, when user agents are too liberal in
what they accept, garbage can become entrenched and difficult to
remove.

> >In the IRI RFC, you were forced to acknowledge the existence of legacy
> >HTTP servers that only accept paths or queries in legacy encodings
> >like Big5 or iso-8859-1. See the bottom of the 3rd paragraph in
> >section 6.4 of RFC 3987 (IRI):
> >
> >http://ietf.org/rfc/rfc3987.txt
>
> The "were forced to acknowledge" isn't appropriate. The whole
> IRI spec is built on the foundation that we don't want to restrict
> what URIs people can use; we only made some of them a lot more
> legible as IRIs.

Yes, it's great work. I read the entire RFC and will be sending
comments and errata. But I stand by my assertion that you felt you had
to mention the legacy character encodings in URIs (in several parts of
the IRI RFC), because otherwise implementors might blindly
percent-decode, convert to UTF-8 and percent-encode again, yielding a
URI that does not work when sent to the legacy HTTP server.

> My personal preference would be that user agents don't touch the
> URIs or IRIs in HTML with respect to NFC/KC, except for cases
> like windows-1258 (Vietnamese) where the input in guaranteed to
> not translate to NFC one-to-one. On the other hand, what they

Windows-1258 appears not to be so common on the Web, but I agree that
there are normalization issues there.

Anyway, my point is that, if the implementors don't change their
implementations to reject full-width w and the like in URIs in HTML
soon, we may eventually find that we feel that we have to describe
this by-then-legacy behavior in some descriptive spec (as opposed to a
prescriptive spec).

> do with URIs/IRIs input into the address bar I think is their
> business. If they can add http:// and www. to the front, and
> .com to the end of what somebody typed, I don't see why we
> would be able to prohibit them doing some normalizations if
> they think it helps the user.

I agree.

Erik