Definitions limit on label length in UTF-8

Tue Sep 15 02:07:40 CEST 2009

Martin J. Dürst writes:
> Hello John,
>
> [Dave, this is Cc'ed to you because of some discussion relating to
> draft-iab-idn-encoding-00.txt.]
>
> [I'm also cc'ing public-iri at w3.org because of the IRI-related issue at
> the end.]
>
> [Everybody, please remove the Cc fields when they are unnecessary.]
>
>
> Overall, I'm afraid that on this issue, more convoluted explanations
> won't convince me nor anybody else, but I'll nevertheless try to answer
> your discussion below point-by-point.
>
> What I (and I guess others on this list) really would like to know is
> whether you have any CONCRETE reports or evidence regarding problems
> with IDN labels that are longer than 63 octets when expressed in UTF-8.
>
> Otherwise, Michel has put it much better than me: "given the lack of
> issues with IDNA2003 on that specific topic there are no reasons to
> introduce an incompatible change".
>
>
> On 2009/09/12 0:47, John C Klensin wrote:
> >
> > --On Friday, September 11, 2009 17:37 +0900 "\"Martin J.
> > Dürst\""<duerst at it.aoyama.ac.jp>  wrote:
> >
> >>> (John claimed that the email context required such a
> >>> rule, but I did not bother to confirm that.)
> >> Given dinosaur implementations such as sendmail, I can
> >> understand the concern that some SMTP implementations may not
> >> easily be upgradable to use domain names with more than 255
> >> octets or labels with more than 63 octets. In than case, I
> >> would have expected at least a security warning at
> >> http://tools.ietf.org/html/rfc4952#section-9 (EAI is currently
> >> written in terms of IDNA2003, and so there are no length
> >> restrictions on U-labels).
> >
> > I obviously have not been explaining this very well.  The
> > problem is not "dinosaur implementations"
>
> Okay, good.
>
> > but a combination of
> > two things (which interact):
> >
> > (1) Late resolution of strings, possibly through APIs that
> > resolve names in places that may not be the public DNS.
> > Systems using those APIs may keep strings in UTF-8 until very
> > late in the process, even passing the UTF-8 strings into the
> > interface or converting them to ACE form just before calling the
> > interface.  Either way, because other systems have come to rely
> > on the 63 octet limit, strings longer than 63 characters pose a
> > risk of unexpected problems.  The issues with this are better
> > explained in draft-iab-idn-encoding-00.txt, which I would
> > strongly encourage people in this WG to go read.

Actually systems using those APIs which are the "standard"
(with a lower case s) APIs, may keep strings in UTF-8 (or even
UTF-16 for common but non-"standard" variants) until very late, and
may keep strings in UTF-8 without ever converting them for some
protocols, e.g. mDNS, that are defined to use UTF-8.

> I have indeed read draft-iab-idn-encoding-00.txt (I sent comments to
> the
> author and the IAB and copied this list). That document mentions the
> length restrictions, as essentially the only restrictions in DNS
> itself,
> rather than in things on top of it. That document also (well, mainly)
> discusses the issue of names being handed down into APIs in various
> forms (UTF-8, UTF-16, punycode, legacy encodings,...), and being
> resolved by various mechanisms (DNS, NetBIOS, mDNS, hosts file,...),
> and
> the problem that these mechanisms may use and expect different
> encodings
> for non-ASCII characters.
>
> However, I haven't found any mention, nor even a hint, in that
> document,
> of a need to restrict punycode labels to less than 63 octets when
> expressed in UTF-8.

I agree with the above characterization.

> The document mentions (as something that might happen, but shouldn't)
> that an application may pass a UTF-8 string to something like
> getaddrinfo, and that string may be passed directly to the DNS. First,
> if this happens, IDNA has already lost.

I'm don't agree with the "shouldn't", and certainly it was not
the intent of draft-iab-idn-encoding-00.txt to actually state
whether this "shouldn't" happen, but that it "can" happen
(and perhaps "does").  There's also a potential argument in the doc
that this is not harmful (see 2nd paragraph of section 4 for
instance, and extrapolate from there).

> Second, whether the string is
> UTF-8 or pure ASCII, if the API isn't prepared to handle labels longer
> than 63 octets and overall names longer than 255 octets defensively
> (i.e. return something like 'not found'), then the programmer should be
> fired. Anyway, in that case, the problem isn't with UTF-8.
>
> What draft-iab-idn-encoding-00.txt essentially points out is that
> different name resolution services use different encodings for non-
> ASCII
> characters, and that currently different users (meaning applications)
> of
> a name resolution API may assume different encodings for non-ASCII
> characters, which creates all kinds of chances for errors. Some
> heuristics may help in some cases, but the right solution (as with all
> cases where characters, and in particular non-ASCII ones, are involved)
> is to clearly say where which encoding is used. A very simple example
> for this is GetAddInfoW, which assumes UTF-16.
>
> The only potential problem that I see from the discussion in
> draft-iab-idn-encoding-00.txt is the following: Some labels containing
> non-ASCII characters that fit into 63 octets in punycode and therefore
> can be resolved with the DNS may not be resolvable with some other
> resolution service because that service may use a different encoding
> (and may or may not have different length limits).
>
> I have absolutely nothing against some text in a Security
> Considerations
> section or in Rationale pointing out that if you want to set up some
> name or label for resolution via multiple different resolution
> services,
> you have to take care that you choose your names and labels so that
> they
> meet the length restrictions for all those services. But that doesn't
> imply at all that we have to artificially restrict the length of
> punycode labels by counting octets in UTF-8.

Completely agree with all of the above.  I think a brief discussion of
this issue may make sense in the next version of draft-iab-idn-encoding,
if we can get IAB consensus on text.

> > (2) The "conversion of DNS name formats" issue that has been
> > extensively discussed as part of the question of alternate label
> > separators (sometimes described in our discussions as
> > "dot-oids").  Applications that use domain names, including
> > domain names that are not going to be resolved (or even looked
> > up), must be able to freely and accurately converted between
> > DNS-external (dot-separated labels) and DNS-internal
> > (length-string pairs) formats _without_ knowing whether they are
> > IDNs or not.
>
> I'm not exactly sure what you mean here. If you want to say "without
> checking whether they contain xn-- prefixes and punycode or not", then
> I
> can agree, but that cannot motivate a UTF-8 based length restriction.

Right.  I'm not sure why most "applications" would care about DNS-
internal (length-string pairs) formats, only NULL-terminated
strings (containing dot-separated labels) that get passed to
getaddrinfo-like functions.  Most applications are (and should be)
oblivious to the fact that DNS or some other protocol is used for
resolving names.

> If you say that applications, rather than first converting U-label ->
> A-label and then converting from dot-separated to length-string
> notation, have to be able to first convert to length-string notation
> and
> then convert U-labels to A-labels, then I contend that nobody in their
> right mind would do it that way, and even less if "dot-oids" are
> involved. For a starter, U-labels don't have a fixed encoding.
>
> > As discussed earlier, one of several reasons for
> > that requirement is that, in non-IDNA-aware contexts, labels in
> > non-IDNA-aware applications or contexts may be perfectly valid
> > as far as the DNS is concerned, because the only restriction the
> > DNS (and the normal label type) imposes is "octets".
>
> If and where somebody has binary labels, of course these binary labels
> must not be longer than 63 octets. But IDNA doesn't use binary labels,
> and doesn't stuff UTF-8 into DNS protocol slots, so for IDNA, any
> length
> restrictions on UTF-8 are irrelevant.
>
> > That
> > length-string format has a hard limit of 63 characters that can
> > be exceeded only if one can figure out how to get a larger
> > number into six bits (see RFC1035, first paragraph of Section
> > 3.1, and elsewhere).
>
> I very well know that the 63 octets (not characters) limit is a hard
> one. In the long run, one might imagine an extension to DNS that uses
> another label format, without this limitation, but there is no need at
> all to go there for this discussion.
>
> > If we permit longer U-label strings on the
> > theory that the only important restriction is on A-labels, we
> > introduce new error states into the format conversion process.
>
> For IDNA, only A-labels get sent through the DNS protocol, so only
> there, the length restrictions for labels is relevant. If somebody gets
> this wrong in the format conversion process (we currently don't have
> any
> reports on that), then that's their problem (and we can point it out in
> a Security section or so).
>
> > If this needs more explanation somewhere (possibly in
> > Rationale), I'm happy to try to do that.  But I think
> > eliminating the restriction would cause far more problems than
> > it is worth.
>
> It hasn't caused ANY problems in IDNA2003. There is nothing new in
> IDNA2008 that would motivate a change. *Running code*, one of the
> guidelines of the IETF, shows that the restriction is unnecessary.
>
>
> > I note that, while I haven't had time to respond, some of the
> > discussion on the IRI list has included an argument that domain
> > names in URIs cannot be restricted to A-label forms but must
> > include %-escaped UTF-8 simply because those strings might not
> > be public-DNS domain names but references to some other database
> > or DNS environment.
>
> It's not 'simply because'. It's first and foremost because of the
> syntactic uniformity of URIs, and the fact that it's impossible to
> identify all domain names in an URI (the usual slot after the '//' is
> easy, scheme-specific processing (which is not what URIs and IRIs are
> about) may be able to deal with some of 'mailto', but what do you do
> about domain names in query parts? Also, this syntax is part of RFC
> 3986, STD 66, a full IETF Standard.
>
> Overall, it's just a question of what escaping convention should be
> used. URIs have their specific escaping convention (%-encoding), and
> DNS
> has its specific escaping convention (punycode).
>
> Also please note that the IRI spec doesn't prohibit to use punycode
> when
> converting to URIs.
>
> In addition, please note that at least my personal implementation
> experience (adding IDN support to Amaya) shows that the overhead of
> supporting %-encoding in domain names in URIs is minimal, and helps
> streamline the implementation.
>
> > It seems to me that one cannot have it
> > both ways -- either the application knows whether a string is a
> > public DNS reference that must conform _only_ to IDNA
> > requirements (but then can be restricted to A-labels) or the
> > application does not know and therefore must conform to DNS
> > requirements for label lengths.
>
> There is absolutely no need to restrict *all* references just because
> *some of them* may use other resolver systems with other length
> restrictions (which may be "63 octets per label when measured in UTF-8"
> or something completely different). It would be very similar to saying
> "Some compilers/linkers can only deal with identifiers 6 characters or
> shorter, so all longer identifiers are prohibited."

I agree with that.

> > For our purposes, the only
> > sensible way, at least IMO, to deal with this is to require
> > conformance to both sets of rules, i.e., 63 character maximum
> > for A-labels and 63 character maximum for U-labels.
>
> As far as I understand punycode, it's impossible to encode a Unicode
> character in less than one octet. This means that a maximum of 63
> *characters* for U-labels is automatically guaranteed by a maximum of
> 63
> characters/octets for A-labels.
>
> However, Defs clearly says "length in octets of the UTF-8 form", so I
> guess this was just a slip of your fingers.
>
> Regards,    Martin.

-Dave