Definitions limit on label length in UTF-8
"Martin J. Dürst"
duerst at it.aoyama.ac.jp
Sat Sep 12 05:14:12 CEST 2009
[Dave, this is Cc'ed to you because of some discussion relating to
[I'm also cc'ing public-iri at w3.org because of the IRI-related issue at
[Everybody, please remove the Cc fields when they are unnecessary.]
Overall, I'm afraid that on this issue, more convoluted explanations
won't convince me nor anybody else, but I'll nevertheless try to answer
your discussion below point-by-point.
What I (and I guess others on this list) really would like to know is
whether you have any CONCRETE reports or evidence regarding problems
with IDN labels that are longer than 63 octets when expressed in UTF-8.
Otherwise, Michel has put it much better than me: "given the lack of
issues with IDNA2003 on that specific topic there are no reasons to
introduce an incompatible change".
On 2009/09/12 0:47, John C Klensin wrote:
> --On Friday, September 11, 2009 17:37 +0900 "\"Martin J.
> Dürst\""<duerst at it.aoyama.ac.jp> wrote:
>>> (John claimed that the email context required such a
>>> rule, but I did not bother to confirm that.)
>> Given dinosaur implementations such as sendmail, I can
>> understand the concern that some SMTP implementations may not
>> easily be upgradable to use domain names with more than 255
>> octets or labels with more than 63 octets. In than case, I
>> would have expected at least a security warning at
>> http://tools.ietf.org/html/rfc4952#section-9 (EAI is currently
>> written in terms of IDNA2003, and so there are no length
>> restrictions on U-labels).
> I obviously have not been explaining this very well. The
> problem is not "dinosaur implementations"
> but a combination of
> two things (which interact):
> (1) Late resolution of strings, possibly through APIs that
> resolve names in places that may not be the public DNS.
> Systems using those APIs may keep strings in UTF-8 until very
> late in the process, even passing the UTF-8 strings into the
> interface or converting them to ACE form just before calling the
> interface. Either way, because other systems have come to rely
> on the 63 octet limit, strings longer than 63 characters pose a
> risk of unexpected problems. The issues with this are better
> explained in draft-iab-idn-encoding-00.txt, which I would
> strongly encourage people in this WG to go read.
I have indeed read draft-iab-idn-encoding-00.txt (I sent comments to the
author and the IAB and copied this list). That document mentions the
length restrictions, as essentially the only restrictions in DNS itself,
rather than in things on top of it. That document also (well, mainly)
discusses the issue of names being handed down into APIs in various
forms (UTF-8, UTF-16, punycode, legacy encodings,...), and being
resolved by various mechanisms (DNS, NetBIOS, mDNS, hosts file,...), and
the problem that these mechanisms may use and expect different encodings
for non-ASCII characters.
However, I haven't found any mention, nor even a hint, in that document,
of a need to restrict punycode labels to less than 63 octets when
expressed in UTF-8.
The document mentions (as something that might happen, but shouldn't)
that an application may pass a UTF-8 string to something like
getaddrinfo, and that string may be passed directly to the DNS. First,
if this happens, IDNA has already lost. Second, whether the string is
UTF-8 or pure ASCII, if the API isn't prepared to handle labels longer
than 63 octets and overall names longer than 255 octets defensively
(i.e. return something like 'not found'), then the programmer should be
fired. Anyway, in that case, the problem isn't with UTF-8.
What draft-iab-idn-encoding-00.txt essentially points out is that
different name resolution services use different encodings for non-ASCII
characters, and that currently different users (meaning applications) of
a name resolution API may assume different encodings for non-ASCII
characters, which creates all kinds of chances for errors. Some
heuristics may help in some cases, but the right solution (as with all
cases where characters, and in particular non-ASCII ones, are involved)
is to clearly say where which encoding is used. A very simple example
for this is GetAddInfoW, which assumes UTF-16.
The only potential problem that I see from the discussion in
draft-iab-idn-encoding-00.txt is the following: Some labels containing
non-ASCII characters that fit into 63 octets in punycode and therefore
can be resolved with the DNS may not be resolvable with some other
resolution service because that service may use a different encoding
(and may or may not have different length limits).
I have absolutely nothing against some text in a Security Considerations
section or in Rationale pointing out that if you want to set up some
name or label for resolution via multiple different resolution services,
you have to take care that you choose your names and labels so that they
meet the length restrictions for all those services. But that doesn't
imply at all that we have to artificially restrict the length of
punycode labels by counting octets in UTF-8.
> (2) The "conversion of DNS name formats" issue that has been
> extensively discussed as part of the question of alternate label
> separators (sometimes described in our discussions as
> "dot-oids"). Applications that use domain names, including
> domain names that are not going to be resolved (or even looked
> up), must be able to freely and accurately converted between
> DNS-external (dot-separated labels) and DNS-internal
> (length-string pairs) formats _without_ knowing whether they are
> IDNs or not.
I'm not exactly sure what you mean here. If you want to say "without
checking whether they contain xn-- prefixes and punycode or not", then I
can agree, but that cannot motivate a UTF-8 based length restriction.
If you say that applications, rather than first converting U-label ->
A-label and then converting from dot-separated to length-string
notation, have to be able to first convert to length-string notation and
then convert U-labels to A-labels, then I contend that nobody in their
right mind would do it that way, and even less if "dot-oids" are
involved. For a starter, U-labels don't have a fixed encoding.
> As discussed earlier, one of several reasons for
> that requirement is that, in non-IDNA-aware contexts, labels in
> non-IDNA-aware applications or contexts may be perfectly valid
> as far as the DNS is concerned, because the only restriction the
> DNS (and the normal label type) imposes is "octets".
If and where somebody has binary labels, of course these binary labels
must not be longer than 63 octets. But IDNA doesn't use binary labels,
and doesn't stuff UTF-8 into DNS protocol slots, so for IDNA, any length
restrictions on UTF-8 are irrelevant.
> length-string format has a hard limit of 63 characters that can
> be exceeded only if one can figure out how to get a larger
> number into six bits (see RFC1035, first paragraph of Section
> 3.1, and elsewhere).
I very well know that the 63 octets (not characters) limit is a hard
one. In the long run, one might imagine an extension to DNS that uses
another label format, without this limitation, but there is no need at
all to go there for this discussion.
> If we permit longer U-label strings on the
> theory that the only important restriction is on A-labels, we
> introduce new error states into the format conversion process.
For IDNA, only A-labels get sent through the DNS protocol, so only
there, the length restrictions for labels is relevant. If somebody gets
this wrong in the format conversion process (we currently don't have any
reports on that), then that's their problem (and we can point it out in
a Security section or so).
> If this needs more explanation somewhere (possibly in
> Rationale), I'm happy to try to do that. But I think
> eliminating the restriction would cause far more problems than
> it is worth.
It hasn't caused ANY problems in IDNA2003. There is nothing new in
IDNA2008 that would motivate a change. *Running code*, one of the
guidelines of the IETF, shows that the restriction is unnecessary.
> I note that, while I haven't had time to respond, some of the
> discussion on the IRI list has included an argument that domain
> names in URIs cannot be restricted to A-label forms but must
> include %-escaped UTF-8 simply because those strings might not
> be public-DNS domain names but references to some other database
> or DNS environment.
It's not 'simply because'. It's first and foremost because of the
syntactic uniformity of URIs, and the fact that it's impossible to
identify all domain names in an URI (the usual slot after the '//' is
easy, scheme-specific processing (which is not what URIs and IRIs are
about) may be able to deal with some of 'mailto', but what do you do
about domain names in query parts? Also, this syntax is part of RFC
3986, STD 66, a full IETF Standard.
Overall, it's just a question of what escaping convention should be
used. URIs have their specific escaping convention (%-encoding), and DNS
has its specific escaping convention (punycode).
Also please note that the IRI spec doesn't prohibit to use punycode when
converting to URIs.
In addition, please note that at least my personal implementation
experience (adding IDN support to Amaya) shows that the overhead of
supporting %-encoding in domain names in URIs is minimal, and helps
streamline the implementation.
> It seems to me that one cannot have it
> both ways -- either the application knows whether a string is a
> public DNS reference that must conform _only_ to IDNA
> requirements (but then can be restricted to A-labels) or the
> application does not know and therefore must conform to DNS
> requirements for label lengths.
There is absolutely no need to restrict *all* references just because
*some of them* may use other resolver systems with other length
restrictions (which may be "63 octets per label when measured in UTF-8"
or something completely different). It would be very similar to saying
"Some compilers/linkers can only deal with identifiers 6 characters or
shorter, so all longer identifiers are prohibited."
> For our purposes, the only
> sensible way, at least IMO, to deal with this is to require
> conformance to both sets of rules, i.e., 63 character maximum
> for A-labels and 63 character maximum for U-labels.
As far as I understand punycode, it's impossible to encode a Unicode
character in less than one octet. This means that a maximum of 63
*characters* for U-labels is automatically guaranteed by a maximum of 63
characters/octets for A-labels.
However, Defs clearly says "length in octets of the UTF-8 form", so I
guess this was just a slip of your fingers.
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst at it.aoyama.ac.jp
More information about the Idna-update