IDNAbis compatibility

Mon Apr 2 21:16:24 CEST 2007

Mark,

We are still not understanding each other.  Let me try again.

--On Sunday, 01 April, 2007 17:09 -0700 Mark Davis
<mark.davis at icu-project.org> wrote:

> I don't see this as a UI issue. Many programs process web
> pages, and depend
> on a correct interpretation of the HTML attribute
> href="<someURL>". These
> include not only browsers, but many other processes (like our
> search engine
> at Google), where no human is involved. And even for a
> browser, what URL
> gets used when you click on a link in a page should should be
> predictable.

> Leaving the mappings from the URL to what is sent to the DNS
> is up to the
> whim of the program doesn't seem to be a good thing, at least
> to me.
>...

Do keep a few things in mind.  First, there are no links whose
definitions _change_ with the current IDNA200x proposal relative
to IDNA2003.  Some become invalid, but there are no instances in
which there is any ambiguity in "what URL gets used when [one]
clicks on a link", nor about program whims about what to send to
the DNS given that _something_ is going to be sent.  If a URL,
as written, is not conformant to IDNA200x, a program that is
considering processing it may either try to interpret it,
applying IDNA2003 rules, or may reject it.  Of course, in theory
such a program could apply some other set of rules entirely,
mapping characters that we would consider unrelated together.
For example, one might map Thai characters onto Latin ones. That
would be dumb but, more important, there is no way to ban its
being done even under IDNA2003.

Second, while mapping width-dependent characters to a single
form is certainly safe, and mapping case to "lower" is almost as
safe, there is great merit in reversibility of the ToASCII and
ToUnicode algorithms or their successors and in requiring that
URIs that contain domain names do so in minimal form (rather
than in "whatever form IDNA2003 can handle") going forward.  We
know that mapping everything to lower case loses information:
not just what the original case was, but there are situations in
which more than one upper-case character maps to a single
lower-case one... and some of those situations are
language-dependent as to whether the mapping occurs. One can try
to ignore those language dependencies, as IDNA2003 does, but
that is one of the sources of real user confusion about what
happens with the IDNA transformations and what characters are
permitted in the DNS.

We also know that some applications developers had decided to
display the lower-case forms to users even when links are
written with the upper-case forms because they are convinced
that the lower-case forms are less subject to phishing and that
rendering something different from what has been written in the
text or href is a lesser evil that reduces the risks of the
greater one.  All of this gives IDNs an unfortunate air of
unpredictability.  The question is what to do about it.

If the argument that you seem to be making were taken to its
logical conclusion, we could make no changes at all: almost
certainly, someone, somewhere, has violated a guideline,
appropriated an unassigned character, or taken advantage of a
strange mapping.  Since some of that may have occurred in domain
subtrees that cannot be searched, we would have no way to find
out about them until they caused problems.

Looked at from a different perspective, a sensible
implementation that needed to deal with lots of legacy URIs in a
global context would presumably want to apply the IDNA2003
mappings for an indefinite period, secure in the understanding
that those mappings would not produce different results after
IDNA200x was deployed than they did earlier.  If the relevant
organization were concerned about the Internet, it would
probably also try to campaign to get those URLs updated and to
be sure that any URLs it generated, rather than used, were in
the newer, reduced, form.  That would be true whether there were
a million URLs written in terms of forms that IDNA2003 maps to
others or only a handful of them.

Alternately, we could keep these two sets of mappings in the
standard.  The disadvantage of doing so is that we would have
more cases that lose information and don't reverse-map back to
the originals and that we might then need a separate rule --
arguably harder to police and understand -- prohibiting domain
names in IRIs that could not be obtained by applying ToUnicode
to a Punycode-encoded domain name.   The advantage is that every
URI-reading interface would treat case and width more or less
the same way... while not applying that consistency to other
mappings required by IDNA2003.  I've obviously got a strong
preference in favor of what I consider consistency and good
behavior over the long term, but this is also not, IMO, the most
critical decision we face.

    john