Re: idna folding (was Re: idna-bis and '゜')

John C Klensin klensin at jck.com
Tue Dec 18 05:20:04 CET 2007


Erik,

Let me fill in a few gaps and then try to respond to the
important part of your comments...

--On Monday, 17 December, 2007 18:38 -0800 Erik van der Poel
<erikv at google.com> wrote:

> Hi Vint,
> 
> Yes, I've been rather focussed on the web, sorry about that. I
> haven't been involved in the email i18n discussions, but let's
> consider it. It's my understanding that when an SMTP server
> supports the UTF-8 option, the client may send the recipient's
> email address in UTF-8 in the envelope.

yes. local-part, domain-part, or both.

> Since SMTP servers
> typically do not allow any user to edit these addresses upon
> receipt, they may try to verify them automatically. 

Unless one invokes the "protect one's system from attack"
exception, they are quite constrained about even that
verification.   If the SMTP server involved is not the "final
delivery server", it basically cannot touch the local-part.  On
the other hand, if it gets a forward-pointing domain-part in
UTF-8, it has to convert it to ACE form (punycode) in order do
the DNS lookup and figure out how to route the mail.  Actions
wrt reverse-pointing envelope addresses are a little more
controversial, but it is fairly common practice today to at
least verify the domain name as a hint about whether the mail is
legitimate and/or any delivery or non-delivery notification
messages will be deliverable.  Again, looking up the domain name
requires conversion of the UTF-8 address to ACE form.

> Perhaps
> the server would want to convert the UTF-8 host name to
> Punycode, to verify it. This may be a situation where it would
> first apply case mappings and NFKC a la IDNA2003 (or some
> newer spec). Or maybe the SMTP UTF-8 spec explicitly forbids
> that or specifies something else, I don't know.

It doesn't say much about interpretation of domain names other
than, essentially, "follow the IDNA spec".

> In any case, it would be great if all such "automatic"
> conversions (i.e. without end-user editing) were done
> according to a single mapping spec, whether the app is a web
> browser, email, or whatever. Does this seem like a good idea?

Actually, while the answer is "yes" on first glance, it actually
doesn't.

Let's put the issues with domain names in the middle of XML or
HTML files and how we get from "there" to "here" aside for the
moment and look at our experience with protocol design.   That
experience says that fewer variations we have about how things
can be expressed on the wire, the better off we end up being
from an interoperability standpoint.   That principle, of
course, applies far more broadly than to IDNs.  We force email
into a single line-ending convention and have troubles when it
is violated.  Many, if not most, recommendations about
transmission of Unicode files suggest normalization before
transmission (or storage), rather than having unnormalized forms
floating around with the hope that the system that tries to use
the files will straighten them out.  SMTP insists that message
bodies that are not in an Internet-standard format be converted
to one as soon as they enter the network, not after they transit
it and reach a target machine.

Even IDNA2003 requires that every domain name label be converted
to a standard form -- the ACE one-- before being transmitted to
the DNS or compared to another label.  These mappings are
many-to-one onto the standard form, not variations that maintain
a separate integrity over operations that involve the network.

So, I think there is no question that it would be good if all
possible forms of a given domain name were converted to a single
standard form before being transmitted over the network.  I
think it would be good if all of those conversions were done in
exactly the same way (even while I'm skeptical about how
practical that is in practice) -- see my comments, responding to
yours, about an extra document.

Where we may disagree is about the contexts in which the variant
forms (those that cannot be regenerated after ToASCII
conversion) should be permitted to be transferred over the
network.  With XML and HTML files, there is a gray area because
the question of what is transferred is a little ambiguous.  But
with email, all of the precedents and all of the experience
suggest that one should be transmitting only strings that are in
reduced, final, form as understood by the destination server.
And that, in turn, requires or at least strongly suggests that
domain names be either in ACE form or in a form that can be
obtained by processing the ACE form back through ToUnicode (or
its IDNAbis equivalent).

One can certainly reach a different conclusion but I suggest
that our operational experience implies that one requires much
stronger justification than "it would be great" or "someone
would like to do it" to justify sending the non-final forms over
the network.

      john






More information about the Idna-update mailing list