how did the idna theory start?

Mon Jul 2 14:17:18 CEST 2012

--On Monday, July 02, 2012 05:50 +0000 "Abdulrahman I. ALGhadir"
<aghadir at citc.gov.sa> wrote:

> Oh, this argument raised a question in my mind:
> 
> Why the EAI stopped using IDNA as solution of downgrading and
> it went to use fully UNICODE? Isn't the same concerns apply on
> it? What I mean updating the email protocols will require to
> update the existence legacy machines, wouldn't be wiser to use
> IDNA to avoid the updating part,

No.

First of all, there are major differences between the DNS
protocols and the email ones.  In the DNS case, one party (a
user or reference) supplies a name, it is looked up by a second
party (a DNS client) using a third-party server (a DNS server,
which may be a non-authoritative cache) that returns a result,
and that result is then used to carry on some action, usually
involving an entirely different protocol.  The result may or may
not be an address record and the action may be any of a variety
of things depending on the record types looked up and returned
and the protocols involved.  If everything in that process
doesn't use the same matching rules and assumptions, very bad
things happen.

In theory for the IDNA case, we could have used a prefix (to
identify the permitted character and matching rules -- see
Patrik's note) rather than a prefix and Punycode encoding
followed by a UTF-8 string.  That would have no real advantage
over IDNA and at least two disadvantages: some increased
potential for BIDI confusion given the mixture of ASCII and
other characters and worse encoding efficiency (one of the
concerns discussed during IDNA development was the disadvantage
in maximum label length imposed by UTF-8, especially for East
Asian characters).  The uTF-8 part of that string would also be
completely rejected by most legacy applications or
implementation -- without asking the DNS because many of those
applications make syntax checks.

We also were fairly confident that DNS labels involving a prefix
consisting of "xn--" were either not in use at all (prior to
IDNA) or were in use only in a tiny number of cases.  A good
deal of work was done to verify that.  By contrast we know that
all sorts of odd things have been encoded in email local-parts,
using all sorts of odd syntax, and that there was no way to be
at all certain that any selected syntax was not in use.

Email is different because there is no third party involved for
local-parts.  The protocol not only treats local-parts as
completely opaque except for interpretation by the final
delivery server but explicitly prohibits any prior relay or
system from interpreting them.  Relays can make octet by octet
comparisons, but the possible results are either "match, same
address" or "doesn't match, no information".  See the discussion
on the EAI list about the mailing list draft and the use of "%"
for one small example of the restrictions this causes.

More important, while the server lookup side of the DNS
protocols is just "string comes in, see if it matches, if so,
return result", the ability to handle an incoming email message
isn't limited to string matching.  A system must be able to
handle non-ASCII header field content (explicitly prohibited in
RFC 5322 and the portions of RFC5321 that affect header fields)
and to have SMTPUTF8-compliant back end processing capability
(see the last six months of discussion on the EAI list about POP
and IMAP "downgrading").   So it was necessary to set the email
situation up so that the server explicitly offered SMTPUTF8
capability and the client then explicitly turned it on.  That
negotiation is pairwise without the arbitrary choices of
servers, some under client rather than registrant control,
imposed by the DNS.  If that negotiation fails, then non-ASCII
header or address material is impossible (other than the
restricted uses of encoded words in some header fields),
regardless of how it is encoded.  If it is successful, there is
little reason to bother with encoding or a special prefix -- it
wouldn't add much value and would increase complications with
that "no interpretation prior to final delivery" rule of SMTP.

> Also, it is a bit confusing when there are multiple of
> solutions are used to solve this issue? Who knows other
> protocol might come up with new one isn't better to fix it on
> one method like using IDNA but instead of using the prefix
> XN-- use different prefix for different mapping based on the
> protocol needs?

Sure.  If Unicode (or some other "universal" character coding
system) had be in wide use when the host table name syntax was
designed, or probably even when the DNS was designed, we
possibly would have done things differently.  IMO, we would have
gotten it wrong because, as Patrik and I have pointed out, we
don't fully understand the comparison issues in a general way
even today. Other CCS models would have made some problems
easier and, almost certainly, others harder.   It may even be
worth nothing that we lost two functions in going from the Host
Table to the DNS.  Those functions that were not considered
important at the time but that haunt us today.  The first was
very strong aliases that permitted asking, not just "what is the
canonical name associated with this domain?" but "what are all
of the aliases associated with this name?".  The second is that,
by retrieving a host table and looking names up in it locally,
local decisions about matching were possible -- something that
is not possible in the DNS because comparison is done on the
server.

It would be very unfortunate if we ended up with per-protocol
encodings and prefixes.  It could happen because some of the
rules in IDNA --characters permitted, the NFC requirement, and
some of the encoding decisions are specific to the needs of the
DNS and might not apply to other protocols (the PRECIS WG is
dealing with just those issues).    

Patrik points out the possibility of switching prefixes if we
discover that we got things seriously wrong.  While doing so
would be better than the alternative, it would be very painful
for all concerned (some of the issues are discussed in the
section on prefix changes in, IIR, RFC 5894).  Of course, we
already have some of that problem because URIs use an encoding
("%") which is octet-based rather than Unicode-character-based
(like IDNA or \u([N[N]]NNNN) (see RFC 5137).  Unless someone can
roll the clock back, I don't know how to solve that problem,
even though I believe, for example, that use of an octet coding
overlaid on UTF-8 was a serious mistake (see the IRI WG
discussions among other things).

But, at least as to the choice of using UTF-8 directly versus
using the DNS-specific IDNA, the tradeoffs seem fairly obvious.

    john

p.s. While I (and presumably Patrik) have been willing to
respond to your questions in order to draw some of these
explanations together, especially in the light of other work
going on in the IETF, I don't have the time to continue the
discussion.  Most of the material above, especially that related
to encoding decisions in EAI, is in the mailing list archives;
some of it is in assorted semi-tutorial documents such as the
RFCs cited above and RFC 5198.  If you want to dig out the
history, I suggest that you are going to need to go back and do
the reading.