LRIs

Fri Apr 3 21:45:38 CEST 2009

I hope we can keep this topic from becoming just another
opportunity to have the WG go off in the weeds.  However, some
of it brings some of the comments on the list that
strictly-local mapping --mapping that doesn't escape from local
machines into the wider environment and how much of a problem it
causes-- into question, so perhaps it is worth a little more
exploration.

More inline below.

--On Friday, April 03, 2009 08:22 -0700 Erik van der Poel
<erikv at google.com> wrote:

> Hi John,
> 
> That was certainly an interesting email about LRIs.
> 
> On Thu, Apr 2, 2009 at 7:49 PM, John C Klensin
> <klensin at jck.com> wrote:
>> "Don't work" isn't just a browser-like application matter.  I
>> would appreciate some verification, but I'm assuming that a
>> creature like
>>    хттп://www.google.com/
>> isn't showing up in the various statistics that Erik and Mark
>> have shared with us because it just isn't recognized as a
>> valid link.
> 
> That's right. We only recognize the ASCII protocol identifiers
> (such as "http:"). If anyone knows of browsers that support
> non-ASCII protocol identifiers, that would be interesting to
> know.

I don't know whether it is done with patches, odd localizations,
or plugins, but at least one of the pages I found that used
хттп:// strongly suggests to me that its authors expected
lines to be copied out of it and pasted into a browser (even
though they decided to not try to make them links, for whatever
reason).  FWIW, the page is 
  http://videodom.net/index.php?showtopic=33
if you want to look at it. 

Of course, a browser add-on that would remap strings that
appeared in the URI input line (and hence work with pasting of
those strings into that line) would presumably be lots easier to
arrange than one that reached down into and rewrote href
arguments.

I'm fairly sure that I've also seen users in Korea typing a
sequence of Hangul characters instead of "http:" and, at least
while 3721 was still alive and separate from Yahoo!, users in
China typing "LRIs" with all of the letters in Han characters.
I don't know what is actually done in the various Arabic-script
countries, but I'd assume, the pain and unpredictability of LtoR
and RtoL switching being what it is, that someone has at least
thought about Arabic-equivalent forms for protocol identifiers
and full URIs, not just the IRI approach (those LRIs would
certainly not be valid IRIs or convertible to URIs using the IRI
rules).

There are people following the list who could provide better
answers to these questions than what I think I've observed
watching people type.  In addition, you have offices in at least
a representative sample of the relevant countries and could try
asking them.

>>  If that hypothesis is correct --for my invented Cyrillic
>> protocol identifier example or for a similar arrangement in
>> any other script-- it would suggest that Google doesn't feel
>> committed to indexing and tracking the links from any possible
>> nonsense that might appear in an HTML file and that, even
>> there, there are stopping rules.

> Yes, there are stopping rules (or at least, we have
> discussions about where to stop). I won't bore the list with
> details, but there are many parsing, canonicalization,
> escaping and encoding issues.

I don't think we need to know.  The concern that caused me to
raise it was that I interpreted some of your remarks and Mark's
to mean that you took anything that could possibly be construed
as a URI, IRI, or domain name as a link that you were obligated
to index.  That discussion occurred in the context of things
that I believed were silly enough that indexing them was
encouraging bad behavior, not for the case of strings that were
plausibly LRIs.  But those LRIs that use transliterated protocol
identifiers in what would otherwise be IRI syntax are pretty
easy to spot (c..c://c..c, where "c" is a letter of some sort,
would make a decent heuristic), so the fact that you haven't
concluded that you need to index them is fairly significant...
presumably about the prevalence with which the localization
environments manage to translate/remap them appropriately
without significant leakage.

>> On the other hand, in the interest of search engine abuse as
>> well as browser abuse, I dropped
>...
> I've written a TODO to write a mapreduce to look for non-ASCII
> protocol identifiers. I suspect there aren't very many in
> URI/IRI slots like href.

I believe that.  But, given that they appear in running text and
in at least some user input, few appearances in explicit URI/IRI
slots would indicate either that either the page creation tools
or the page authors are a lot smarter about keeping local
conventions local than some of the comments on this list have
given them credit for, wouldn't it?

      john