Definitions limit on label length in UTF-8
"Martin J. Dürst"
duerst at it.aoyama.ac.jp
Fri Sep 11 06:20:08 CEST 2009
Hello John, others,
Many thanks for your explanations. It seems that this issue hasn't
really been thought through in enough detail. I think there are quite a
few misunderstandings, see below.
On 2009/09/11 1:08, John C Klensin wrote:
>
> --On Thursday, September 10, 2009 11:31 -0400 Andrew Sullivan
> <ajs at shinkuro.com> wrote:
>
>> On Thu, Sep 10, 2009 at 07:25:25PM +0900, "Martin J. Dürst"
>> wrote:
>>> There is at least one big issue in there, namely the issue of
>>> limiting the length of labels by measuring their length in
>>> UTF-8. I very much hope this issue can be fixed asap.
>> I think I understand your objection, but I'm surprised that
>> you think it is totally new. I just looked, and the same
>> basic text is in the -00 draft of definitions, which appeared
>> in October of 2008.
>>
>> What length restriction would you prefer instead? I suspect
>> the reason for the restriction is that a "domain name label
>> slot" in most applications is 63 octets long.
I would prefer no length restrictions. If we want, we can check what the
maximum length of a U-Label is in certain encodings, and give that
information as a help to implementers. My guess is that a rough bound is
63*4=252 bytes for all of UTF-8, UTF-16, and UTF-32. For all these
encodings, the maximum length of a single character in bytes is 4 bytes,
although the probability of a character to reach 4 bytes is different
(100% for UTF-32, much less for the others). It is possible that there's
a corner case that needs a few bytes more, or that there is some
reasoning that allows this limit to be reduced by a few bytes, but
currently, that's details.
> FWIW, that was exactly the concern that motivated the text
> (which, I believe, was actually in Rationale before the text was
> pulled into Definitions-00).
That may well be. But was it ever in IDNA2003? I don't think so. And if
it wasn't in IDNA2003, why did it suddenly get included in IDNA2008?
"Concern" is a bad motivation if there's no clear justification for it.
In particular, it's a bad motivation if all the current major browsers
handle longer labels without problems.
> We are expecting applications to
> be able to switch freely back and forth between U-labels and
> A-labels in the same "slots" (or buffers, or whatever word one
> wants to use).
Where did you get that idea from? And why do you think these buffers
will use UTF-8 only, and that all the other encodings are irrelevant?
For your information, most browsers use UTF-16 internally, not UTF-8.
While browsers aren't the only kind of software that is doing IDN
lookups, they are certainly a good example of implementations. Also, the
IDNA implementations that I know (idnkit in particular) use UTF-32
internally because that's more straightforward for normalization and
punycode calculations.
Also, whatever limit we set in a spec doesn't at all guarantee that the
input doesn't contain longer labels. So having a fixed-size buffer with
63 characters for a label (or 255 for a domain name) is a bad idea in
the first place. In general, everybody working with fixed-size buffers
(except maybe at the very lowest level, such as raw DNS record
components, but even there, only in certain cases) is very prone to
buffer overflow attacks, and hopefully has abandoned such bad practice
by now.
In addition, needing a variable number (number of labels) of fixed
length (max length of label) buffers isn't too much of a simplification.
Also, the assumption that U-labels are converted to A-labels in place in
the same slot will be wrong in most if not all cases. Rarely these days
there's an API where conversions overwrite the input. And from an
application point of view, sooner or later you need both A-label and
U-label, so it's better to keep both around anyway.
> I haven't done the arithmetic, but I strongly
> suspect that, if one ended up with a label consisting of code
> points from plane 1 or above that were close together (those
> code points occupy four octets each in UTF-8), the compactness
> of Punycode encoding could result in a UTF-8 string that was
> longer than the ACE.
[Short summary: It's very easy to create UTF-8 strings that are longer
than punycode, for everything except US-ASCII. Remember, punycode was
*designed* to be efficient, in particular for domain name labels.]
Have you actually read my last call comment? I showed an example in
Hiragana that was 58+4=62 octets in punycode but 123 octets in UTF-8 (82
in UTF-16, 164 in UTF-32). Hiragana is in the BMP. So are Greek,
Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, NKo (all of them
using 2 bytes per character in UTF-8, and quite a bit more compact than
Hiragana, in particular if you look at the base alphabet in lower case),
and Devanagari, ... (30 or more scripts, all of them taking 3 bytes per
character in UTF-8, and likewise very compact).
So it's not some historic stuff in Plane 1 and up, it's virtually every
living and widely used scrip. The 4-byte overhead of the prefix and the
initialization of compression in punycode make it more difficult that an
A-Label is shorter than a U-label for short labels, but as soon as
labels get a bit longer, A-labels will be shorter than U-labels. That's
certainly the case around 22 characters, where the restriction starts to
kick in for characters that take 3 bytes in UTF-8, and also around 32
characters for scripts that take 2 bytes in UTF-8.
Even for Latin, it's not impossible to construct examples where UTF-8
U-Labels will be longer than A-labels if you throw in enough non-ASCII
characters. The main scripts where the chance of A-labels being shorter
than U-labels in UTF-8 is smaller are Chinese (Hanzi), Korean (Hangul),
Japanese (Kanji only or combined Kanji/Hiragana/Katakana), and other
scripts that have a lot of characters.
> And that, in turn, could lead to all sorts
> of practical problems if we remove the dual test for 63 octets.
Why do you think so? Do you have any reports about such problems for
IDNA2003? How many?
This is an important issue. I have earlier expressed my concern that
with IDNA2008, we are trying to fix perceived problems in IDNA2003, but
we have no idea how people will react to the 'new ways' of IDNA2008,
simply because they only complain when they actually see and feel an
issues; they have no experience to read a spec and imagine issues.
In the case here at hand, we seem to do worse than fix a known problem
and potentially create new, unknown ones. We 'fix' something that isn't
known to be a problem, and create something that will be preceived as a
problem. I at least have no reports of problems on this point for
IDNA2003 (and we have tests (http://www.w3.org/2001/08/iri-test/) that
show that domain names with labels that measure to more than 63 octets
when expressed in UTF-8 work fine in all major browsers; see my Last
Call mail; the tests also pass on Google Chrome (Windows, version 2)).
On the other hand, it's fairly clear that some people will be extremely
annoyed if they suddenly get told that their domain name was okay in
IDNA2003 but now isn't allowed anymore in IDNA2008 because it is too long.
People not familiar with the history of the development of IDNA2003
should be aware of the fact that a lot of energy went into the
development of compression algorithms for domain names, and that
punycode won against several other contenders because it was clearly
more efficient in particular for the kinds of strings typically expected
in domain name labels. With this, punycode made sure that there was not
too much of an imbalance for the maximum length of labels in terms of
characters between US-ASCII and everything else. Although not
necessarily expressed implicitly, there was very clearly the assumption
that what counted was the length of (what we now call) A-labels, and
that there were no byte-wise restrictions on the length of U-labels. The
"max 63 octets in UTF-8" provision, unless removed, negates all this effort.
> I'm agnostic as to whether this needs to be explained better (in
> either Definitions or Rationale), but (speaking personally, not
> as editor) I would be extremely hesitant to change the
> restriction.
I would be extremely hesitant to INTRODUCE such a restriction that
hasn't been in IDNA2003, was based on wrong or incomplete assumptions,
is contradicted by implementation experience, and, as the above
arguments show, doesn't seem to have been discussed in detail (or I
guess at least John would remember).
It may be argued that it is too late to remove this restriction. But
isn't the purpose of Last Call exactly to catch stuff that has been
overlooked until now, and this is exactly such a case?
Regards, Martin.
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst at it.aoyama.ac.jp
More information about the Idna-update
mailing list