Definitions limit on label length in UTF-8

Fri Sep 11 10:37:19 CEST 2009

Hello Eric, others,

On 2009/09/11 14:08, Erik van der Poel wrote:
> John and I did in fact discuss this issue on Jan 2, 2008 on this
> mailing list.

Digging through the archives, this must be these three messages:
http://www.alvestrand.no/pipermail/idna-update/2008-January/000808.html
http://www.alvestrand.no/pipermail/idna-update/2008-January/000809.html
http://www.alvestrand.no/pipermail/idna-update/2008-January/000810.html

> Unfortunately, I did not pursue it after that, though I
> did point out that the 63/255/UTF-8 restrictions are simply incorrect
> in certain apps.

Yes indeed. At least presently, and there is no reason to change that.

> (John claimed that the email context required such a
> rule, but I did not bother to confirm that.)

Given dinosaur implementations such as sendmail, I can understand the 
concern that some SMTP implementations may not easily be upgradable to 
use domain names with more than 255 octets or labels with more than 63 
octets. In than case, I would have expected at least a security warning 
at http://tools.ietf.org/html/rfc4952#section-9 (EAI is currently 
written in terms of IDNA2003, and so there are no length restrictions on 
U-labels).

> Most HTML implementers would realize that the rule does not apply, and
> would simply ignore it.

It would be a really bad idea to specify something that we know will be 
ignored. It would be much better to point out that some protocols or 
implementations may have difficulties with such labels, but leave it to 
individual protocols to set further restrictions if that's really 
needed, or to warn about potential implementation problems.

> The 63/255 restriction applies to DNS (i.e.
> A-labels), not U-labels,

Yes indeed.

> and certainly not
> IRIs/LEIRIs/HREFs/whatever-we-call-them-this-year.

They have been called IRIs for quite a while. LEIRI and HREF are terms 
for slight variants of IRIs that sometimes turn up in practice (in some 
sense similar to the -obsolete productions in some specs, if you will).

Regards,    Martin.

> Erik
>
> On Thu, Sep 10, 2009 at 9:20 PM, "Martin J. Dürst"
> <duerst at it.aoyama.ac.jp>  wrote:
>> Hello John, others,
>>
>> Many thanks for your explanations. It seems that this issue hasn't
>> really been thought through in enough detail. I think there are quite a
>> few misunderstandings, see below.
>>
>> On 2009/09/11 1:08, John C Klensin wrote:
>>> --On Thursday, September 10, 2009 11:31 -0400 Andrew Sullivan
>>> <ajs at shinkuro.com>    wrote:
>>>
>>>> On Thu, Sep 10, 2009 at 07:25:25PM +0900, "Martin J. Dürst"
>>>> wrote:
>>>>> There is at least one big issue in there, namely the issue of
>>>>> limiting  the length of labels by measuring their length in
>>>>> UTF-8. I very much  hope this issue can be fixed asap.
>>>> I think I understand your objection, but I'm surprised that
>>>> you think it is totally new.  I just looked, and the same
>>>> basic text is in the -00 draft of definitions, which appeared
>>>> in October of 2008.
>>>>
>>>> What length restriction would you prefer instead?  I suspect
>>>> the reason for the restriction is that a "domain name label
>>>> slot" in most applications is 63 octets long.
>> I would prefer no length restrictions. If we want, we can check what the
>> maximum length of a U-Label is in certain encodings, and give that
>> information as a help to implementers. My guess is that a rough bound is
>> 63*4=252 bytes for all of UTF-8, UTF-16, and UTF-32. For all these
>> encodings, the maximum length of a single character in bytes is 4 bytes,
>> although the probability of a character to reach 4 bytes is different
>> (100% for UTF-32, much less for the others). It is possible that there's
>> a corner case that needs a few bytes more, or that there is some
>> reasoning that allows this limit to be reduced by a few bytes, but
>> currently, that's details.
>>
>>
>>> FWIW, that was exactly the concern that motivated the text
>>> (which, I believe, was actually in Rationale before the text was
>>> pulled into Definitions-00).
>> That may well be. But was it ever in IDNA2003? I don't think so. And if
>> it wasn't in IDNA2003, why did it suddenly get included in IDNA2008?
>> "Concern" is a bad motivation if there's no clear justification for it.
>> In particular, it's a bad motivation if all the current major browsers
>> handle longer labels without problems.
>>
>>
>>> We are expecting applications to
>>> be able to switch freely back and forth between U-labels and
>>> A-labels in the same "slots" (or buffers, or whatever word one
>>> wants to use).
>> Where did you get that idea from? And why do you think these buffers
>> will use UTF-8 only, and that all the other encodings are irrelevant?
>> For your information, most browsers use UTF-16 internally, not UTF-8.
>> While browsers aren't the only kind of software that is doing IDN
>> lookups, they are certainly a good example of implementations. Also, the
>> IDNA implementations that I know (idnkit in particular) use UTF-32
>> internally because that's more straightforward for normalization and
>> punycode calculations.
>>
>> Also, whatever limit we set in a spec doesn't at all guarantee that the
>> input doesn't contain longer labels. So having a fixed-size buffer with
>> 63 characters for a label (or 255 for a domain name) is a bad idea in
>> the first place. In general, everybody working with fixed-size buffers
>> (except maybe at the very lowest level, such as raw DNS record
>> components, but even there, only in certain cases) is very prone to
>> buffer overflow attacks, and hopefully has abandoned such bad practice
>> by now.
>>
>> In addition, needing a variable number (number of labels) of fixed
>> length (max length of label) buffers isn't too much of a simplification.
>> Also, the assumption that U-labels are converted to A-labels in place in
>> the same slot will be wrong in most if not all cases. Rarely these days
>> there's an API where conversions overwrite the input. And from an
>> application point of view, sooner or later you need both A-label and
>> U-label, so it's better to keep both around anyway.
>>
>>
>>> I haven't done the arithmetic, but I strongly
>>> suspect that, if one ended up with a label consisting of code
>>> points from plane 1 or above that were close together (those
>>> code points occupy four octets each in UTF-8), the compactness
>>> of Punycode encoding could result in a UTF-8 string that was
>>> longer than the ACE.
>> [Short summary: It's very easy to create UTF-8 strings that are longer
>> than punycode, for everything except US-ASCII. Remember, punycode was
>> *designed* to be efficient, in particular for domain name labels.]
>>
>> Have you actually read my last call comment? I showed an example in
>> Hiragana that was 58+4=62 octets in punycode but 123 octets in UTF-8 (82
>> in UTF-16, 164 in UTF-32). Hiragana is in the BMP. So are Greek,
>> Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, NKo (all of them
>> using 2 bytes per character in UTF-8, and quite a bit more compact than
>> Hiragana, in particular if you look at the base alphabet in lower case),
>> and Devanagari, ... (30 or more scripts, all of them taking 3 bytes per
>> character in UTF-8, and likewise very compact).
>>
>> So it's not some historic stuff in Plane 1 and up, it's virtually every
>> living and widely used scrip. The 4-byte overhead of the prefix and the
>> initialization of compression in punycode make it more difficult that an
>> A-Label is shorter than a U-label for short labels, but as soon as
>> labels get a bit longer, A-labels will be shorter than U-labels. That's
>> certainly the case around 22 characters, where the restriction starts to
>> kick in for characters that take 3 bytes in UTF-8, and also around 32
>> characters for scripts that take 2 bytes in UTF-8.
>>
>> Even for Latin, it's not impossible to construct examples where UTF-8
>> U-Labels will be longer than A-labels if you throw in enough non-ASCII
>> characters. The main scripts where the chance of A-labels being shorter
>> than U-labels in UTF-8 is smaller are Chinese (Hanzi), Korean (Hangul),
>> Japanese (Kanji only or combined Kanji/Hiragana/Katakana), and other
>> scripts that have a lot of characters.
>>
>>
>>> And that, in turn, could lead to all sorts
>>> of practical problems if we remove the dual test for 63 octets.
>> Why do you think so? Do you have any reports about such problems for
>> IDNA2003? How many?
>>
>> This is an important issue. I have earlier expressed my concern that
>> with IDNA2008, we are trying to fix perceived problems in IDNA2003, but
>> we have no idea how people will react to the 'new ways' of IDNA2008,
>> simply because they only complain when they actually see and feel an
>> issues; they have no experience to read a spec and imagine issues.
>>
>> In the case here at hand, we seem to do worse than fix a known problem
>> and potentially create new, unknown ones. We 'fix' something that isn't
>> known to be a problem, and create something that will be preceived as a
>> problem. I at least have no reports of problems on this point for
>> IDNA2003 (and we have tests (http://www.w3.org/2001/08/iri-test/) that
>> show that domain names with labels that measure to more than 63 octets
>> when expressed in UTF-8 work fine in all major browsers; see my Last
>> Call mail; the tests also pass on Google Chrome (Windows, version 2)).
>> On the other hand, it's fairly clear that some people will be extremely
>> annoyed if they suddenly get told that their domain name was okay in
>> IDNA2003 but now isn't allowed anymore in IDNA2008 because it is too long.
>>
>> People not familiar with the history of the development of IDNA2003
>> should be aware of the fact that a lot of energy went into the
>> development of compression algorithms for domain names, and that
>> punycode won against several other contenders because it was clearly
>> more efficient in particular for the kinds of strings typically expected
>> in domain name labels. With this, punycode made sure that there was not
>> too much of an imbalance for the maximum length of labels in terms of
>> characters between US-ASCII and everything else. Although not
>> necessarily expressed implicitly, there was very clearly the assumption
>> that what counted was the length of (what we now call) A-labels, and
>> that there were no byte-wise restrictions on the length of U-labels. The
>> "max 63 octets in UTF-8" provision, unless removed, negates all this effort.
>>
>>
>>> I'm agnostic as to whether this needs to be explained better (in
>>> either Definitions or Rationale), but (speaking personally, not
>>> as editor) I would be extremely hesitant to change the
>>> restriction.
>> I would be extremely hesitant to INTRODUCE such a restriction that
>> hasn't been in IDNA2003, was based on wrong or incomplete assumptions,
>> is contradicted by implementation experience, and, as the above
>> arguments show, doesn't seem to have been discussed in detail (or I
>> guess at least John would remember).
>>
>>
>> It may be argued that it is too late to remove this restriction. But
>> isn't the purpose of Last Call exactly to catch stuff that has been
>> overlooked until now, and this is exactly such a case?
>>
>>
>> Regards,    Martin.
>>
>> --
>> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
>> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp