Definitions limit on label length in UTF-8

Fri Sep 11 06:20:08 CEST 2009

Hello John, others,

Many thanks for your explanations. It seems that this issue hasn't 
really been thought through in enough detail. I think there are quite a 
few misunderstandings, see below.

On 2009/09/11 1:08, John C Klensin wrote:
>
> --On Thursday, September 10, 2009 11:31 -0400 Andrew Sullivan
> <ajs at shinkuro.com>  wrote:
>
>> On Thu, Sep 10, 2009 at 07:25:25PM +0900, "Martin J. Dürst"
>> wrote:
>>> There is at least one big issue in there, namely the issue of
>>> limiting  the length of labels by measuring their length in
>>> UTF-8. I very much  hope this issue can be fixed asap.
>> I think I understand your objection, but I'm surprised that
>> you think it is totally new.  I just looked, and the same
>> basic text is in the -00 draft of definitions, which appeared
>> in October of 2008.
>>
>> What length restriction would you prefer instead?  I suspect
>> the reason for the restriction is that a "domain name label
>> slot" in most applications is 63 octets long.

I would prefer no length restrictions. If we want, we can check what the 
maximum length of a U-Label is in certain encodings, and give that 
information as a help to implementers. My guess is that a rough bound is 
63*4=252 bytes for all of UTF-8, UTF-16, and UTF-32. For all these 
encodings, the maximum length of a single character in bytes is 4 bytes, 
although the probability of a character to reach 4 bytes is different 
(100% for UTF-32, much less for the others). It is possible that there's 
a corner case that needs a few bytes more, or that there is some 
reasoning that allows this limit to be reduced by a few bytes, but 
currently, that's details.

> FWIW, that was exactly the concern that motivated the text
> (which, I believe, was actually in Rationale before the text was
> pulled into Definitions-00).

That may well be. But was it ever in IDNA2003? I don't think so. And if 
it wasn't in IDNA2003, why did it suddenly get included in IDNA2008? 
"Concern" is a bad motivation if there's no clear justification for it. 
In particular, it's a bad motivation if all the current major browsers 
handle longer labels without problems.

> We are expecting applications to
> be able to switch freely back and forth between U-labels and
> A-labels in the same "slots" (or buffers, or whatever word one
> wants to use).

Where did you get that idea from? And why do you think these buffers 
will use UTF-8 only, and that all the other encodings are irrelevant? 
For your information, most browsers use UTF-16 internally, not UTF-8. 
While browsers aren't the only kind of software that is doing IDN 
lookups, they are certainly a good example of implementations. Also, the 
IDNA implementations that I know (idnkit in particular) use UTF-32 
internally because that's more straightforward for normalization and 
punycode calculations.

Also, whatever limit we set in a spec doesn't at all guarantee that the 
input doesn't contain longer labels. So having a fixed-size buffer with 
63 characters for a label (or 255 for a domain name) is a bad idea in 
the first place. In general, everybody working with fixed-size buffers 
(except maybe at the very lowest level, such as raw DNS record 
components, but even there, only in certain cases) is very prone to 
buffer overflow attacks, and hopefully has abandoned such bad practice 
by now.

In addition, needing a variable number (number of labels) of fixed 
length (max length of label) buffers isn't too much of a simplification. 
Also, the assumption that U-labels are converted to A-labels in place in 
the same slot will be wrong in most if not all cases. Rarely these days 
there's an API where conversions overwrite the input. And from an 
application point of view, sooner or later you need both A-label and 
U-label, so it's better to keep both around anyway.

> I haven't done the arithmetic, but I strongly
> suspect that, if one ended up with a label consisting of code
> points from plane 1 or above that were close together (those
> code points occupy four octets each in UTF-8), the compactness
> of Punycode encoding could result in a UTF-8 string that was
> longer than the ACE.

[Short summary: It's very easy to create UTF-8 strings that are longer 
than punycode, for everything except US-ASCII. Remember, punycode was 
*designed* to be efficient, in particular for domain name labels.]

Have you actually read my last call comment? I showed an example in 
Hiragana that was 58+4=62 octets in punycode but 123 octets in UTF-8 (82 
in UTF-16, 164 in UTF-32). Hiragana is in the BMP. So are Greek, 
Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, NKo (all of them 
using 2 bytes per character in UTF-8, and quite a bit more compact than 
Hiragana, in particular if you look at the base alphabet in lower case), 
and Devanagari, ... (30 or more scripts, all of them taking 3 bytes per 
character in UTF-8, and likewise very compact).

So it's not some historic stuff in Plane 1 and up, it's virtually every 
living and widely used scrip. The 4-byte overhead of the prefix and the 
initialization of compression in punycode make it more difficult that an 
A-Label is shorter than a U-label for short labels, but as soon as 
labels get a bit longer, A-labels will be shorter than U-labels. That's 
certainly the case around 22 characters, where the restriction starts to 
kick in for characters that take 3 bytes in UTF-8, and also around 32 
characters for scripts that take 2 bytes in UTF-8.

Even for Latin, it's not impossible to construct examples where UTF-8 
U-Labels will be longer than A-labels if you throw in enough non-ASCII 
characters. The main scripts where the chance of A-labels being shorter 
than U-labels in UTF-8 is smaller are Chinese (Hanzi), Korean (Hangul), 
Japanese (Kanji only or combined Kanji/Hiragana/Katakana), and other 
scripts that have a lot of characters.

> And that, in turn, could lead to all sorts
> of practical problems if we remove the dual test for 63 octets.

Why do you think so? Do you have any reports about such problems for 
IDNA2003? How many?

This is an important issue. I have earlier expressed my concern that 
with IDNA2008, we are trying to fix perceived problems in IDNA2003, but 
we have no idea how people will react to the 'new ways' of IDNA2008, 
simply because they only complain when they actually see and feel an 
issues; they have no experience to read a spec and imagine issues.

In the case here at hand, we seem to do worse than fix a known problem 
and potentially create new, unknown ones. We 'fix' something that isn't 
known to be a problem, and create something that will be preceived as a 
problem. I at least have no reports of problems on this point for 
IDNA2003 (and we have tests (http://www.w3.org/2001/08/iri-test/) that 
show that domain names with labels that measure to more than 63 octets 
when expressed in UTF-8 work fine in all major browsers; see my Last 
Call mail; the tests also pass on Google Chrome (Windows, version 2)). 
On the other hand, it's fairly clear that some people will be extremely 
annoyed if they suddenly get told that their domain name was okay in 
IDNA2003 but now isn't allowed anymore in IDNA2008 because it is too long.

People not familiar with the history of the development of IDNA2003 
should be aware of the fact that a lot of energy went into the 
development of compression algorithms for domain names, and that 
punycode won against several other contenders because it was clearly 
more efficient in particular for the kinds of strings typically expected 
in domain name labels. With this, punycode made sure that there was not 
too much of an imbalance for the maximum length of labels in terms of 
characters between US-ASCII and everything else. Although not 
necessarily expressed implicitly, there was very clearly the assumption 
that what counted was the length of (what we now call) A-labels, and 
that there were no byte-wise restrictions on the length of U-labels. The 
"max 63 octets in UTF-8" provision, unless removed, negates all this effort.

> I'm agnostic as to whether this needs to be explained better (in
> either Definitions or Rationale), but (speaking personally, not
> as editor) I would be extremely hesitant to change the
> restriction.

I would be extremely hesitant to INTRODUCE such a restriction that 
hasn't been in IDNA2003, was based on wrong or incomplete assumptions, 
is contradicted by implementation experience, and, as the above 
arguments show, doesn't seem to have been discussed in detail (or I 
guess at least John would remember).

It may be argued that it is too late to remove this restriction. But 
isn't the purpose of Last Call exactly to catch stuff that has been 
overlooked until now, and this is exactly such a case?

Regards,    Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp