Erik van der Poel
erikv at google.com
Fri Apr 3 17:22:48 CEST 2009
That was certainly an interesting email about LRIs.
On Thu, Apr 2, 2009 at 7:49 PM, John C Klensin <klensin at jck.com> wrote:
> "Don't work" isn't just a browser-like application matter. I
> would appreciate some verification, but I'm assuming that a
> creature like
> isn't showing up in the various statistics that Erik and Mark
> have shared with us because it just isn't recognized as a valid
That's right. We only recognize the ASCII protocol identifiers (such
as "http:"). If anyone knows of browsers that support non-ASCII
protocol identifiers, that would be interesting to know.
> If that hypothesis is correct --for my invented Cyrillic
> protocol identifier example or for a similar arrangement in any
> other script-- it would suggest that Google doesn't feel
> committed to indexing and tracking the links from any possible
> nonsense that might appear in an HTML file and that, even there,
> there are stopping rules.
Yes, there are stopping rules (or at least, we have discussions about
where to stop). I won't bore the list with details, but there are many
parsing, canonicalization, escaping and encoding issues.
> On the other hand, in the interest of search engine abuse as
> well as browser abuse, I dropped хттп://www.microsoft.com/.
> As far as I can tell, it matched the string (not an IRI, but as
> a string) and gave me links to several Russian pages that
> contain Russian text in Cyrillic and a lot of LRIs -- strings
> that start in хттп://. On the pages I actually looked at,
> all of the domain names are LDH-conforming (not either U-labels
> or A-labels) but some of the tails are ASCII and others are
> Cyrillic or mixed. So LRIs are alive and well, they leak
> without causing massive interoperability problems, and my guess
> that "хттп:" might be used as a substitute for "http:"
> wasn't quite as wild as I assumed.
I've written a TODO to write a mapreduce to look for non-ASCII
protocol identifiers. I suspect there aren't very many in URI/IRI
slots like href.
More information about the Idna-update