Distributed configuration of "private" IDNA (Re: IDNA and getnameinfo() and getaddrinfo())

Thu Jun 17 19:55:06 CEST 2010

On Wed, Jun 16, 2010 at 09:28:34PM -0400, John C Klensin wrote:
> --On Wednesday, June 16, 2010 16:54 -0500 Nicolas Williams
> <Nicolas.Williams at oracle.com> wrote:
> > So, to resolve tést.{foó, foóbar, óther}.example. the
> > _resolver_ would first have to split the input string into
> > labels using whatever fullstops are legal in the current
> > locale, then lookup each of those domains' IDNA rules in the
> > example. TLD zone, do whatever codeset conversions and
> > pre-processing may be required to meet the rules found, then
> > do the next query.  And so on.
> 
> Well, remember that, if fullstops are not global, one needs to
> be very careful to keep local ones from leaking.  If they do

Since I was concerning myself with the DNS protocol in particular, there
is no such concern (full stops don't appear in DNS the protocol).

> leak, a parser that tries to separate an FQDN into labels will
> end up with a high error rate.  That would make the bad guys,
> who have lots of fun with URLs that trick users into believing
> that third- or fourth-level names are really second-level ones,
> very happy.  I trust their happiness is not our goal.

Very good point.  Full stops need to be globally defined for all
locales.

Of course, my proposal was a strawman, intended primarily to show that
we cannot be expected to support private DNS clouds with non-standard
IDN rules.

> > Sounds good, BUT there's issues w.r.t. stub resolvers and
> > caching: stub resolvers suddenly have to get pretty fancy,
> > even if the are using caching servers, because suddenly
> > recursive caching servers are not useful for looking up IDNs!
> 
> Right.  And, if you start thinking about DNAME and other things
> that prevent you from knowing definitively which tree someone
> thinks that a name/label is in, the difficulties with caching
> servers start looking easy.   Remember that there is not even an
> inherent DNS restriction that would prevent having a label in a
> private namespace for a DNAME RR whose Data points into the
> public one DNS. 

I've not thought about that enough, but I suspect that one could setup
the IDN meta-rules so that this is not a problem: each label in any (and
all) FQDNs need to be handled according to the IDN rules advertised for
the containing zone, and every FQDN that appears in any one zone must be
encoded according to that zone's advertised IDN rules.

> Could the mess have been avoided if the implications of the
> native UTF-8 (and other native encodings, such as direct use of
> 8859-1) had been known and analyzed when the IDNA work was being
> done?  Well, perhaps, but actually I have serious doubts.  The
> public-DNS TLDs that were selling 8859-1 names prior to IDNA2003
> really didn't care -- they were in the name-selling business
> and, if some of those names weren't able to be used in
> applications... well, buyer beware.  The decision to wrap IDNA
> around an ACE was made fairly consciously and with a moderately
> good understanding of what we were getting into.  If we had
> understood that better, or made different tradeoffs, the answers
> might have come out a little different but I don't think very
> much.  And, while the Punycode algorithm and encoding takes the
> heat in the current draft, it is difficult to understand how any
> other ACE encoding would have been much better.

I think we could have insisted on using UTF-8 on the wire in DNS.  Yes,
that would have taken time to adopt as plenty of legacy deployments
might have needed upgrading.  However, it's taken a very long time for
IDNA to be adopted as well (particularly outside web browsers), and DNS
security vulnerabilities have meant that many/most legacy DNS
deployments did get updated.  Then we could have avoided ACE and
Punycode altogether.

However, I'm not proposing that we cry over spilled milk.  If you've
read my posts on this list in the past week then you know that I'm
trying to make IDNA easier on applications by promoting better APIs (see
my comments regarding getaddrinfo() and getnameinfo()).

> Now, this particular mess could have been avoided almost
> entirely had the IDN WG decided to use UTF-8 in the DNS instead
> of going through Nameprep and an ACE.  The WG decided to not do
> that, partially because it, perhaps unlike some of the private
> implementations that are now using UTF-8 directly, understood
> that user expectations and matching issues required
> normalization and careful attention to matching procedures and
> that getting the DNS to do that and applications to accept it
> would result in a _very_ long implementation and deployment
> curve.  And the WG decided that deployment time was important
> and that a long time before general availability was
> intolerable.  Real tradeoff there.

_That_ understanding that normalization is required was sorely
_mistaken_.

Don't get me wrong: of course Unicode normalization matters.

But what we all missed for so long was that normalization-insensitive
matching is possible, just as case-insensitive matching is.  I know this
because that's exactly what we implemented in ZFS in OpenSolaris for
filename lookups.

The traditional DNS was case-insenstive/case-preserving.  Making a
Unicode-aware DNS be normalization-insensitive/preserving was, in
_retrospect_, equally feasible.

However, we're now _stuck_ with ACE, which means: we're stuck with
_clients_ (not servers) having to casefold and normalize IDNs, which
results in different semantics than the traditional DNS.  I think we can
live with this, if nothing else because now we must :/

This mistake was really the result of the Unicode Consortium focusing of
the process of normalization and not on how it should be used by
_developers_.  A straightforward implementation of normalization as
described by the UC requires allocating memory in many cases, and when
it doesn't it may still result in a destructive operation on the input
string -- both of these being extremely undesirable side-effects when
all one wants to do is compare strings.

I keep coming back to this: we need to consider APIs, we need to
consider the real world as it looks to _developers_, the people who
write the bloody code.  By "we" I mean: standards-setting organizations.

If the UC had described normalization-insensitive string comparison a
decade ago, then more lightbulbs might have gone off a decade ago.

In retrospect, to me, normalization-insensitive string comparison is a
blindingly obvious idea.  Of course, having been in the thick of it when
we decided to go that way in ZFS, I know it wasn't really that obvious.
But I believe it likely would have been if the UC had considered things
from developers' points of view.

> Note that one of the advantages private namespaces have over
> public ones is that they are typically fairly homogeneous wrt
> software, management, or both.   [...]

Are they really homogeneous?  I doubt it.  Or at least I doubt that
they'll stay that way indefinitely.  Deploying private namespaces with
alternative IDN rules seems like a terrible idea to me, something we
should discourage.

> But, if we had a situation in which the public namespaces were
> using IDNA2003 UTF-8 strings, and the private ones were using
> unmodified/ unmapped UTF-8 strings, we would still have a
> problem because we could get false matches in both environments
> depending on the assumptions made.  [...]

Not at all!  We'd have invented normalization-insensitiveness sooner to
deal with that.

(Other differences involving various mappings and codepoint prohibitions
would have been few and far between, and also best handled by having
clients send _raw_ UTF-8, with servers implementing whatever
mappings/prohibitions might be needed.)

> One more recent set of decisions is reminiscent of the IDNA
> ACE/Punycode one.  If there were no IDN TLDs and, preferably, a
> very small and infrequently-changing number of TLDs total, then
> it would be fairly easy to devise ways to distinguish between
> UTF-8-using private namespaces and A-label-using public ones.

If we ever consider that, then my proposal in this sub-thread should get
serious consideration.  However, I hope we don't.

> ICANN has not seemed to be very interested in that issue and the
> tradeoffs it implies.

Good.

> In this context, Shawn wrote:
> 
> > The good thing about Punycode/IDN is that it enabled DNS.  The
> > bad thing is that suddenly any network app needs to become a
> > DNS expert.

Again, and again, I'll keep coming back to this: better APIs can help
avoid this.  The whole reason I subscribed and started posting to this
list last week was that we need improved APIs.  Your I-D describes the
problem and hints at solutions, but it targets Informational status.
Instead I propose that we pursue some Standards-Track APIs, as in
Simon's IDNA-extensions-for-getaddrinfo()/getnameinfo() I-D.

> Borrowing a theme from another discussion that has been going on
> in parallel, the good thing about getnameinfo and getaddrinfo
> are that they enable IPv6.  The bad thing is that suddenly any
> network app needs to become a routing preferences expert.   As

Really?  I don't see the analogy.

> Ned Freed pointed out in that context, if you really want this
> to be transparent to the application, the relevant interface is
> some flavor of "SetupConnectionByName" with which the
> application starts with an opaque name and then, subject to some
> parameters or function-name variations, ends up with a
> connection.  Sadly, taking away the need for expert knowledge of
> the DNS alone really doesn't help a lot.

Exactly!  Ned's "SetupConnectionByName" is an example of "better APIs".

Nico
--