Touchstones for "Mapping"

Thu Apr 2 23:26:53 CEST 2009

There's a REALLY big reason.  Some tools are no fancier, but the tool user's don't know the first thing about punycode.  Some authors using those tools don't even process Latin letters very well themselves.

If some user sends a native script text message to a friend, or puts it in an unaware email app, a simple blog or wiki tool, or even a plain text file, then it doesn't help to enforce A-labels.  It can even be worse.

Assuming A-labels were preferred in these environments, U-labels would still leak in.

FWIW: were I building html/xml from a text editor, it is still likely that I'd choose U-labels so I could debug.  But that's me, you may prefer your way.

Sent from my AT&T Samsung i907 Windows Mobile® Smartphone.

-----Original Message-----
From: John C Klensin <klensin at jck.com>
Sent: Thursday, April 02, 2009 4:04 PM
To: Martin J. Dürst <duerst at it.aoyama.ac.jp>; Harald Tveit Alvestrand <harald at alvestrand.no>
Cc: Andrew Sullivan <ajs at shinkuro.com>; idna-update at alvestrand.no <idna-update at alvestrand.no>
Subject: Re: Touchstones for "Mapping"

--On Tuesday, March 31, 2009 16:57 +0900 "\"Martin J. Dürst\""
<duerst at it.aoyama.ac.jp> wrote:

> I very much agree with Harald. We are working on IDNs because
> we want  humans to be able to easily read domain names in
> their script. Storing them as A-Labels when there is a
> reasonable chance that humans will have a look at them (e.g.
> in HTML or XML source, email source,...) is against the very
> intent of IDNs. Authors are humans, too, even if they work on
> plain text :-!

Martin,

Let me try a different take on this from some of the other
comments that followed your note...

I don't know if I'm a "user" or not -- it is another term that
we throw around without a clear definition and thereby end up in
a state of confusion about what different ones of us are talking
about -- and one that, to your credit, you didn't use, but
others in this thread have.  I'm certainly an occasional author
of content that uses domain names (including IDNs) and would
claim to be human.

I am, however, apparently in a small minority of those who
compose a lot of XML or HTML pages: I do my editing with an
emacs clone, enhanced by XML/HTML extensions that keep track of
matching elements and facilitate pretty-printing, but very
little else.  I'm frankly too cheap to spring for a
well-supported, easy-to-use, and well-documented HTML or XML
editor and too lazy to learn and adapt to one of the others (I
maintain zone files with emacs too).  But the norm is that
humans use special applications to cope with these files,
applications that have buttons for inserting URIs and greater or
lesser degrees of validation and completion support for those
functions.  The existence and wide use of those tools turns the
issue you are addressing back into a presentation problem --
there is no plausible reason why the tools should not be capable
of putting strings with A-labels into the relevant files while
letting the author type in native-character Unicode strings and
to see those strings.  There is also no plausible reason why
those tools cannot validate links (whichever form they are in)
and, if I recall, some of them do exactly that.

Having those tools turns the issue you are raising back into a
presentation one -- just as other flavors of WISIWG editors
permit the user to enter text without having to enter or see the
markup that controls formatting, there should be no reason why I
can't type (and see) native-character strings even while
A-labels go into the data files.   As an author, I want the
entries in my files and on my web pages to be as exact as
possible.  Especially considering the few differences in
interpretation between IDNA2008 and IDNA2003 and the potential
for application implementation differences in mapping support,
that implies that I want any mapping to occur where I have
control over it -- in the typing and presentation interfaces
between me and the file-- and to have A-labels in the file where
I do not have control over how something else might interpret
the strings.

Michel's note is, I think, consistent with this point of view.

I suggest that is why the URI spec effectively requires A-labels
for IDNs and why it should not be changed.

The question then becomes one of how far we should take the
standard toward required support for Unicode strings in various
places, not to accommodate authors who are using sophisticated
editing tools, but to accommodate dinosaurs like myself.  My
answer would be "not very far".  I'm willing to type Unicode
strings into files and then pass those files through a converter
(and validator) before passing it to someone else and to
consider the annoyance of doing that to be the price I pay for
refusal to use better tools.  And, if I get tired of that, I
know how to extend my editor so that it supports an IRI -> URI
conversion function that does whatever mapping I consider
necessary (with or without WG specification of that mapping).

You wrote later...

> There are two sides here, the protocol correctness and
> the content correctness. By content correctness, I mean
> whether the link e.g. goes to the intended page.
> Completely impossible to check with punycode, of course.

I disagree.  Validation of whether the link goes to the intended
page is a separate function I have to invoke, implicitly or
explicitly, to go check the link.  Such a validator should work
at least as well with A-labels ("punycode") as it does with IRIs
or other native-character strings.   The thing that A-labels
interfere with is making a superficial visual check of whether
the URI (or IRI) is plausible.  That can be very important, but
is back to being a presentation issue.  If you really want to
know whether something is going to the intended page, I can
think of no alternative to actually following the link (or
having some agent act for you in doing so).

     john

p.s. note that nothing above addresses the question of what to
do about those who have put IRIs into URI contexts or treated
URIs as if they are IRIs.  As Gerv suggests, one option is to be
very permissive about dealing with those files and then
incrementally tighten up as problems are exposed.   Another, not
necessarily different, is to try to guess what the author meant,
with the understanding that the difference between a
standards-supported guess that involves mapping differs from
local ingenuity more by degree than in kind.   One or the other
is likely to be necessary until and unless authors of
application on the lookup side are willing to say "you violated
the standard and hence get second-rate treatment".   Getting
from a U-label to an A-label is a lot more reliable than getting
from something that is not a U-label to a U-label, but still not
as reliable as having an A-label in the file.

_______________________________________________
Idna-update mailing list
Idna-update at alvestrand.no
http://www.alvestrand.no/mailman/listinfo/idna-update