LRIs

Sun Apr 5 18:26:27 CEST 2009

On Fri, Apr 3, 2009 at 12:45 PM, John C Klensin <klensin at jck.com> wrote:
> I hope we can keep this topic from becoming just another
> opportunity to have the WG go off in the weeds.  However, some
> of it brings some of the comments on the list that
> strictly-local mapping --mapping that doesn't escape from local
> machines into the wider environment and how much of a problem it
> causes-- into question, so perhaps it is worth a little more
> exploration.

I do see how it is relevant.

> --On Friday, April 03, 2009 08:22 -0700 Erik van der Poel
> <erikv at google.com> wrote:
>> That was certainly an interesting email about LRIs.
>>
>> On Thu, Apr 2, 2009 at 7:49 PM, John C Klensin
>> <klensin at jck.com> wrote:
>>> "Don't work" isn't just a browser-like application matter.  I
>>> would appreciate some verification, but I'm assuming that a
>>> creature like
>>>    хттп://www.google.com/
>>> isn't showing up in the various statistics that Erik and Mark
>>> have shared with us because it just isn't recognized as a
>>> valid link.
>>
>> That's right. We only recognize the ASCII protocol identifiers
>> (such as "http:"). If anyone knows of browsers that support
>> non-ASCII protocol identifiers, that would be interesting to
>> know.
>
> I don't know whether it is done with patches, odd localizations,
> or plugins, but at least one of the pages I found that used
> хттп:// strongly suggests to me that its authors expected
> lines to be copied out of it and pasted into a browser (even
> though they decided to not try to make them links, for whatever
> reason).  FWIW, the page is
>  http://videodom.net/index.php?showtopic=33
> if you want to look at it.

Yes, I looked at that page after your previous email. I also ran the
introductory text through Google's Russian translator:

"Often there is no need to use a paid software at work, you can do and
free. And some samples of this software even surpass a similar
commercial product.

Here is a small collection, gathered by me in инете may be useful to-string ...
(http will need to change to http)"

That last parenthetical instruction sounds like the author is
expecting the reader to manually change the Cyrillic хттп to http, but
I could be wrong.

> Of course, a browser add-on that would remap strings that
> appeared in the URI input line (and hence work with pasting of
> those strings into that line) would presumably be lots easier to
> arrange than one that reached down into and rewrote href
> arguments.

That seems likely.

> I'm fairly sure that I've also seen users in Korea typing a
> sequence of Hangul characters instead of "http:" and, at least
> while 3721 was still alive and separate from Yahoo!, users in
> China typing "LRIs" with all of the letters in Han characters.
> I don't know what is actually done in the various Arabic-script
> countries, but I'd assume, the pain and unpredictability of LtoR
> and RtoL switching being what it is, that someone has at least
> thought about Arabic-equivalent forms for protocol identifiers
> and full URIs, not just the IRI approach (those LRIs would
> certainly not be valid IRIs or convertible to URIs using the IRI
> rules).
>
> There are people following the list who could provide better
> answers to these questions than what I think I've observed
> watching people type.  In addition, you have offices in at least
> a representative sample of the relevant countries and could try
> asking them.

I will ask around.

>>>  If that hypothesis is correct --for my invented Cyrillic
>>> protocol identifier example or for a similar arrangement in
>>> any other script-- it would suggest that Google doesn't feel
>>> committed to indexing and tracking the links from any possible
>>> nonsense that might appear in an HTML file and that, even
>>> there, there are stopping rules.
>
>> Yes, there are stopping rules (or at least, we have
>> discussions about where to stop). I won't bore the list with
>> details, but there are many parsing, canonicalization,
>> escaping and encoding issues.
>
> I don't think we need to know.  The concern that caused me to
> raise it was that I interpreted some of your remarks and Mark's
> to mean that you took anything that could possibly be construed
> as a URI, IRI, or domain name as a link that you were obligated
> to index.

We may be talking about two different things here: indexing and
auto-linking. As far as I know, Google does not index URLs found in
the plain text portions of HTML. We only index URLs in standard
href-like locations, such as <a href="...">, <img src="..."> and so
on.

There have also been discussions on this list about "auto-linking"
(for lack of a better term), where apps such as email user agents
automatically detect URIs, IRIs and domain names in running (plain)
text, and then turn them into clickable links.

Of course, both of these typically take advantage of a stock IDNA
library, which has Nameprep and Punycode rolled into one package, so
you get the mapping whether you want it or not.

> That discussion occurred in the context of things
> that I believed were silly enough that indexing them was
> encouraging bad behavior

I can see why you seem to think that auto-linkers are encouraging bad
behavior, but I also feel that ordinary users are unlikely to type
perfect URIs into plain text, and that ordinary application
programmers are quite likely to perform auto-linking as a "favor" to
the user.

Likewise, browsers and search engines are fairly lenient about URIs
and IRIs, and once you start down that path, it seems difficult to get
the programmers to make their apps stricter.

> not for the case of strings that were
> plausibly LRIs.  But those LRIs that use transliterated protocol
> identifiers in what would otherwise be IRI syntax are pretty
> easy to spot (c..c://c..c, where "c" is a letter of some sort,
> would make a decent heuristic), so the fact that you haven't
> concluded that you need to index them is fairly significant...
> presumably about the prevalence with which the localization
> environments manage to translate/remap them appropriately
> without significant leakage.

We haven't yet determined that those LRIs are automatically remapped
by any software. I'm not ruling it out, though.

Erik