Definitions limit on label length in UTF-8

Fri Sep 11 07:08:53 CEST 2009

John and I did in fact discuss this issue on Jan 2, 2008 on this
mailing list. Unfortunately, I did not pursue it after that, though I
did point out that the 63/255/UTF-8 restrictions are simply incorrect
in certain apps. (John claimed that the email context required such a
rule, but I did not bother to confirm that.)

Most HTML implementers would realize that the rule does not apply, and
would simply ignore it. The 63/255 restriction applies to DNS (i.e.
A-labels), not U-labels, and certainly not
IRIs/LEIRIs/HREFs/whatever-we-call-them-this-year.

Erik

On Thu, Sep 10, 2009 at 9:20 PM, "Martin J. Dürst"
<duerst at it.aoyama.ac.jp> wrote:
> Hello John, others,
>
> Many thanks for your explanations. It seems that this issue hasn't
> really been thought through in enough detail. I think there are quite a
> few misunderstandings, see below.
>
> On 2009/09/11 1:08, John C Klensin wrote:
>>
>> --On Thursday, September 10, 2009 11:31 -0400 Andrew Sullivan
>> <ajs at shinkuro.com>  wrote:
>>
>>> On Thu, Sep 10, 2009 at 07:25:25PM +0900, "Martin J. Dürst"
>>> wrote:
>>>> There is at least one big issue in there, namely the issue of
>>>> limiting  the length of labels by measuring their length in
>>>> UTF-8. I very much  hope this issue can be fixed asap.
>>> I think I understand your objection, but I'm surprised that
>>> you think it is totally new.  I just looked, and the same
>>> basic text is in the -00 draft of definitions, which appeared
>>> in October of 2008.
>>>
>>> What length restriction would you prefer instead?  I suspect
>>> the reason for the restriction is that a "domain name label
>>> slot" in most applications is 63 octets long.
>
> I would prefer no length restrictions. If we want, we can check what the
> maximum length of a U-Label is in certain encodings, and give that
> information as a help to implementers. My guess is that a rough bound is
> 63*4=252 bytes for all of UTF-8, UTF-16, and UTF-32. For all these
> encodings, the maximum length of a single character in bytes is 4 bytes,
> although the probability of a character to reach 4 bytes is different
> (100% for UTF-32, much less for the others). It is possible that there's
> a corner case that needs a few bytes more, or that there is some
> reasoning that allows this limit to be reduced by a few bytes, but
> currently, that's details.
>
>
>> FWIW, that was exactly the concern that motivated the text
>> (which, I believe, was actually in Rationale before the text was
>> pulled into Definitions-00).
>
> That may well be. But was it ever in IDNA2003? I don't think so. And if
> it wasn't in IDNA2003, why did it suddenly get included in IDNA2008?
> "Concern" is a bad motivation if there's no clear justification for it.
> In particular, it's a bad motivation if all the current major browsers
> handle longer labels without problems.
>
>
>> We are expecting applications to
>> be able to switch freely back and forth between U-labels and
>> A-labels in the same "slots" (or buffers, or whatever word one
>> wants to use).
>
> Where did you get that idea from? And why do you think these buffers
> will use UTF-8 only, and that all the other encodings are irrelevant?
> For your information, most browsers use UTF-16 internally, not UTF-8.
> While browsers aren't the only kind of software that is doing IDN
> lookups, they are certainly a good example of implementations. Also, the
> IDNA implementations that I know (idnkit in particular) use UTF-32
> internally because that's more straightforward for normalization and
> punycode calculations.
>
> Also, whatever limit we set in a spec doesn't at all guarantee that the
> input doesn't contain longer labels. So having a fixed-size buffer with
> 63 characters for a label (or 255 for a domain name) is a bad idea in
> the first place. In general, everybody working with fixed-size buffers
> (except maybe at the very lowest level, such as raw DNS record
> components, but even there, only in certain cases) is very prone to
> buffer overflow attacks, and hopefully has abandoned such bad practice
> by now.
>
> In addition, needing a variable number (number of labels) of fixed
> length (max length of label) buffers isn't too much of a simplification.
> Also, the assumption that U-labels are converted to A-labels in place in
> the same slot will be wrong in most if not all cases. Rarely these days
> there's an API where conversions overwrite the input. And from an
> application point of view, sooner or later you need both A-label and
> U-label, so it's better to keep both around anyway.
>
>
>> I haven't done the arithmetic, but I strongly
>> suspect that, if one ended up with a label consisting of code
>> points from plane 1 or above that were close together (those
>> code points occupy four octets each in UTF-8), the compactness
>> of Punycode encoding could result in a UTF-8 string that was
>> longer than the ACE.
>
> [Short summary: It's very easy to create UTF-8 strings that are longer
> than punycode, for everything except US-ASCII. Remember, punycode was
> *designed* to be efficient, in particular for domain name labels.]
>
> Have you actually read my last call comment? I showed an example in
> Hiragana that was 58+4=62 octets in punycode but 123 octets in UTF-8 (82
> in UTF-16, 164 in UTF-32). Hiragana is in the BMP. So are Greek,
> Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, NKo (all of them
> using 2 bytes per character in UTF-8, and quite a bit more compact than
> Hiragana, in particular if you look at the base alphabet in lower case),
> and Devanagari, ... (30 or more scripts, all of them taking 3 bytes per
> character in UTF-8, and likewise very compact).
>
> So it's not some historic stuff in Plane 1 and up, it's virtually every
> living and widely used scrip. The 4-byte overhead of the prefix and the
> initialization of compression in punycode make it more difficult that an
> A-Label is shorter than a U-label for short labels, but as soon as
> labels get a bit longer, A-labels will be shorter than U-labels. That's
> certainly the case around 22 characters, where the restriction starts to
> kick in for characters that take 3 bytes in UTF-8, and also around 32
> characters for scripts that take 2 bytes in UTF-8.
>
> Even for Latin, it's not impossible to construct examples where UTF-8
> U-Labels will be longer than A-labels if you throw in enough non-ASCII
> characters. The main scripts where the chance of A-labels being shorter
> than U-labels in UTF-8 is smaller are Chinese (Hanzi), Korean (Hangul),
> Japanese (Kanji only or combined Kanji/Hiragana/Katakana), and other
> scripts that have a lot of characters.
>
>
>> And that, in turn, could lead to all sorts
>> of practical problems if we remove the dual test for 63 octets.
>
> Why do you think so? Do you have any reports about such problems for
> IDNA2003? How many?
>
> This is an important issue. I have earlier expressed my concern that
> with IDNA2008, we are trying to fix perceived problems in IDNA2003, but
> we have no idea how people will react to the 'new ways' of IDNA2008,
> simply because they only complain when they actually see and feel an
> issues; they have no experience to read a spec and imagine issues.
>
> In the case here at hand, we seem to do worse than fix a known problem
> and potentially create new, unknown ones. We 'fix' something that isn't
> known to be a problem, and create something that will be preceived as a
> problem. I at least have no reports of problems on this point for
> IDNA2003 (and we have tests (http://www.w3.org/2001/08/iri-test/) that
> show that domain names with labels that measure to more than 63 octets
> when expressed in UTF-8 work fine in all major browsers; see my Last
> Call mail; the tests also pass on Google Chrome (Windows, version 2)).
> On the other hand, it's fairly clear that some people will be extremely
> annoyed if they suddenly get told that their domain name was okay in
> IDNA2003 but now isn't allowed anymore in IDNA2008 because it is too long.
>
> People not familiar with the history of the development of IDNA2003
> should be aware of the fact that a lot of energy went into the
> development of compression algorithms for domain names, and that
> punycode won against several other contenders because it was clearly
> more efficient in particular for the kinds of strings typically expected
> in domain name labels. With this, punycode made sure that there was not
> too much of an imbalance for the maximum length of labels in terms of
> characters between US-ASCII and everything else. Although not
> necessarily expressed implicitly, there was very clearly the assumption
> that what counted was the length of (what we now call) A-labels, and
> that there were no byte-wise restrictions on the length of U-labels. The
> "max 63 octets in UTF-8" provision, unless removed, negates all this effort.
>
>
>> I'm agnostic as to whether this needs to be explained better (in
>> either Definitions or Rationale), but (speaking personally, not
>> as editor) I would be extremely hesitant to change the
>> restriction.
>
> I would be extremely hesitant to INTRODUCE such a restriction that
> hasn't been in IDNA2003, was based on wrong or incomplete assumptions,
> is contradicted by implementation experience, and, as the above
> arguments show, doesn't seem to have been discussed in detail (or I
> guess at least John would remember).
>
>
> It may be argued that it is too late to remove this restriction. But
> isn't the purpose of Last Call exactly to catch stuff that has been
> overlooked until now, and this is exactly such a case?
>
>
> Regards,    Martin.
>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>