Touchstones for "Mapping"

Fri Apr 3 04:49:31 CEST 2009

--On Thursday, April 02, 2009 21:26 +0000 "Shawn Steele (???)"
<Shawn.Steele at microsoft.com> wrote:

> There's a REALLY big reason.  Some tools are no fancier, but
> the tool user's don't know the first thing about punycode.
> Some authors using those tools don't even process Latin
> letters very well themselves.

And the user of MSWord (or OpenOffice, if one prefers) doesn't
know the first thing about how formatting information is stored
internally, doesn't want to know, and doesn't think about it.
Most tools in that category actually make it fairly hard for the
user to find out how the files are structured --not a decision
that is to my taste, but I'm not a typical user, and the
decision certainly has not hurt the products in the marketplace.
You appear to be arguing that the internal ACE format of IDNs is
somehow different from that word processing analogy.  I don't
agree.   I think you are describing poor-quality tools and I'm
more than willing to let the marketplace sort those out rather
than believing that standards should cater to them.

When one gets to the authors who "don't even process Latin
letters very well", one also gets to the point that has caused
me to be more optimistic about local mapping actually working
well -- much better than some of the fears of massive
non-interoperability have predicted.  I've observed two models
for letting such authors work (occasionally used together).  In
one, the author selects a function that permits constructing the
IRI or URI -- but that does so in a way in which mapping domain
name fields to A-labels straightforward (and often invisible).
In the other, something that we can perhaps describe as an LRI
--a localized resource identifier-- is used.  The LRI differs
from an IRI because the protocol identifier and ASCII delimiters
are expressed in a form that is convenient for use with the
local script, character set, and keyboard, using translation,
transliteration, or just some locally-selected convention.

Despite the observation that LRIs are heavily used in some
environments and used with local conventions, we don't see
interoperability problems with them on the global/public
Internet... because we don't see them at all.  Either local
tools translate them into valid URIs or IRIs, or they just don't
work.  Out of curiousity, I just tried entering 
   хттп://www.microsoft.com/
(preserving the delimiters and crudely transliterating "http"
into Cyrillic) into IE7.  I don't suppose it will come as a
surprise to you or anyone else that the browser gets confused:
the string isn't recognized as a URI and the browser generates a
URI that looks like

	http://MySearchEngine/search?q=%D1%85%D1%82%D1%82%D0%BF:
	//www.microsoft.com/&rls=com.microsoft:en-us&ie=UTF-8&oe
	=UTF-8
	&startIndex=&startPage=1

In the interest of equal-opportunity browser abuse, I tried the
same thing with FireFox and got 

	http://MySearchEngine/search?ie=UTF-8&oe=UTF-8&sourceid=
	navclient&gfns=1&q=%D1%85%D1%82%D1%82%D0%BF%3A%2F%2Fwww.
	microsoft.com%2F

Both browsers are arguably behaving correctly because neither
URIs nor IRIs are permitted to have anything non-ASCII as a
protocol identifier, so it must be some string I want to search
for.  Right.  Curiously, there is only one significant
difference between the two and that is that, when the search URL
is shown in the URL bar, IE shows it as above (URI form) and
FireFox shows it with the Cyrillic (i.e., in IRI form).   But
when I copied the links and pasted them into this message, I get
the %-escaped UTF-8 for both.

Again, I'm not suggesting that either of you are doing anything
wrong and, if you were, it would be far outside the scope of
this WG.   But the point is that those users who are typing IRIs
(or extended URIs) in non-Latin environments are already being
subjected to a lot of reinterpretation and transformations of
what they think they are typing; I just don't see the logic of
"we can remap everything else, including the protocol
identifiers, but the U-label form IDNs (and maybe even the
M-label forms) are sacrosanct".

"Don't work" isn't just a browser-like application matter.  I
would appreciate some verification, but I'm assuming that a
creature like 
    хттп://www.google.com/
isn't showing up in the various statistics that Erik and Mark
have shared with us because it just isn't recognized as a valid
link.  If that hypothesis is correct --for my invented Cyrillic
protocol identifier example or for a similar arrangement in any
other script-- it would suggest that Google doesn't feel
committed to indexing and tracking the links from any possible
nonsense that might appear in an HTML file and that, even there,
there are stopping rules.

On the other hand, in the interest of search engine abuse as
well as browser abuse, I dropped хттп://www.microsoft.com/.
As far as I can tell, it matched the string (not an IRI, but as
a string) and gave me links to several Russian pages that
contain Russian text in Cyrillic and a lot of LRIs -- strings
that start in хттп://.  On the pages I actually looked at,
all of the domain names are LDH-conforming (not either U-labels
or A-labels) but some of the tails are ASCII and others are
Cyrillic or mixed.  So LRIs are alive and well, they leak
without causing massive interoperability problems, and my guess
that "хттп:" might be used as a substitute for "http:"
wasn't quite as wild as I assumed.

> If some user sends a native script text message to a friend,
> or puts it in an unaware email app, a simple blog or wiki
> tool, or even a plain text file, then it doesn't help to
> enforce A-labels.  It can even be worse.
> 
> Assuming A-labels were preferred in these environments,
> U-labels would still leak in.

And so would a great deal of trash that doesn't conform to
either URI or IRI rules, as illustrated above.  

> FWIW: were I building html/xml from a text editor, it is still
> likely that I'd choose U-labels so I could debug.  But that's
> me, you may prefer your way.

I think it would depend on what you were trying to debug.   My
limited experience tells me that I'm better off using A-labels
in the actual links and including UTF-8 comments that remind me
what I was trying to do (for the amusement of those who deal
with such things, the XML source this WG's documents and for RFC
5198 contain just such comments -- for the latter, the RFC
Editor staff was a little surprised, but the generated text file
is plain ASCII, so no rules were broken).

best,
    john