[Technical Errata Reported] RFC5890 (4695)

Martin J. Dürst duerst at it.aoyama.ac.jp
Sun May 22 08:40:34 CEST 2016

[What I say below also applies to erratum 4696. If it's desirable to 
reply to that with the same comment, please let me know.]

I believe that Juan is essentially right.

This has come up before before, and possibly already noted by John 
Klensin for fixing in an eventual update.

I provided some more detailed calculations with examples in the mail to 
idna-update at alvestrand.no with the following identifying details:
Message-ID: <4AACA7E6.1070503 at it.aoyama.ac.jp>
Date: Sun, 13 Sep 2009 17:05:58 +0900

Unfortunately, when I currently try to access the archive at
http://www.alvestrand.no/pipermail/idna-update/ from 
https://www.ietf.org/wg/concluded/idnabis.html, I get the following:


You don't have permission to access /pipermail/idna-update/ on this server.

Apache/2.4.7 (Ubuntu) Server at www.alvestrand.no Port 80

I have cc'ed Harald in the hope that the archive can be fixed soon.

I'm coping the relevant part of that mail here:

Here are my calculations. After a few tests, one finds out that punycode
uses a single 'a' to express 'one more of the same character'. The
question is then how many characters it takes punycode to express the
first character. Expressing that first character takes more and more
punycode characters as its Unicode number gets higher, so one has to
test with the smallest Unicode character that needs a certain number of
bytes in UTF-8. Going through lengths 1,2,3, and 4 per character in
UTF-8, we find:

1 octet per character in UTF-8:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org gives
and has 63 characters, so 63 octets in UTF-8, 126 octets in UTF-16, and
252 octets in UTF-32.

2 octets per character in UTF-8:
¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢.org gives
and has 58 characters, so 116 octets in UTF-8, 116 octets in UTF-16, and
232 octets in UTF-32. 59 seems possible in theory, but impossible in

ँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँ.org (using the currently lowest encoded character that needs 3 bytes,
and has 57 characters, so 171 octets in UTF-8, 114 octets in UTF-16, and
228 octets in UTF-32. Please note that even characters in the U+0800
range would need that much, because already a character such as 'ü'
needs that much.

Trying to assess how many characters one could use of

(using U+10300, OLD ITALIC LETTER A, the lowest character in Unicode 3.2
that needs 4 bytes in UTF-8) gives
and has 56 characters, so 224 octets in UTF-8, 224 octets in UTF-16, and
224 octets in UTF-32.

Overall, we get a maximum label length in octets of 252 octets for
UTF-32 (with US-ASCII), and 224 octets in UTF-8 and UTF-16 (with Old
Italic and the like).

Regards,   Martin.

On 2016/05/18 00:58, RFC Errata System wrote:
> The following errata report has been submitted for RFC5890,
> "Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework".
> --------------------------------------
> You may review the report below and at:
> http://www.rfc-editor.org/errata_search.php?rfc=5890&eid=4695
> --------------------------------------
> Type: Technical
> Reported by: Juan Altmayer Pizzorno <juan at sparkpost.com>
> Section:
> Original Text
> -------------
> expansion of the A-label form to a U-label may produce strings that are
> much longer than the normal 63 octet DNS limit (potentially up to 252
> characters)
> Corrected Text
> --------------
> expansion of the A-label form to a U-label may produce strings that are
> much longer than the normal 63 octet DNS limit (potentially up to 59
> Unicode code points or 236 octets)
> Notes
> -----
> The contents of U-labels are encoded in the up to 59 ASCII characters (see itself)
> output by the Punycode algorithm in their corresponding A-labels.  The Punycode
> decoder (https://tools.ietf.org/html/rfc3492#section-6.2) consumes at least one
> of those ASCII characters for each code point inserted into the U-label. An U-label,
> thus, can contain at the most 59 Unicode code points.
> Since U-labels are defined (in to be expressed in a standard Unicode Encoding
> Form, and UTF-32, UTF-16 and UTF-8 (as revised by RFC3629) all can encode a code
> point in at most 4 octets, 236 octets is an upper bound for an U-label's length.
> I think it should be possible to derive a tighter bound, but its rationale would likely be
> less straighforward.
> I imagine the number 252 was originally derived by multiplying 63, the maximum
> length of an A-label (including the "xn--" prefix), by 4, the maximum number of
> octets needed to represent a code point.
> Instructions:
> -------------
> This erratum is currently posted as "Reported". If necessary, please
> use "Reply All" to discuss whether it should be verified or
> rejected. When a decision is reached, the verifying party (IESG)
> can log in to change the status and edit the report, if necessary.
> --------------------------------------
> RFC5890 (draft-ietf-idnabis-defs-13)
> --------------------------------------
> Title               : Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework
> Publication Date    : August 2010
> Author(s)           : J. Klensin
> Category            : PROPOSED STANDARD
> Source              : Internationalized Domain Names in Applications (Revised)
> Area                : Applications
> Stream              : IETF
> Verifying Party     : IESG
> _______________________________________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update

More information about the Idna-update mailing list