[Technical Errata Reported] RFC5890 (4695)

Wed Sep 28 15:44:03 CEST 2016

Hi!

I think it would be good to move these errata ahead, be it as
“verified” or with alternate wording:  the issue of the required
buffer sizes came up while my team implemented SMTPUTF8, and these
(incorrect) sizes given in the RFC created confusion.

.. Juan

> On May 22, 2016, at 2:40 AM, Martin J. Dürst <duerst at it.aoyama.ac.jp> wrote:
> 
> [What I say below also applies to erratum 4696. If it's desirable to reply to that with the same comment, please let me know.]
> 
> I believe that Juan is essentially right.
> 
> This has come up before before, and possibly already noted by John Klensin for fixing in an eventual update.
> 
> I provided some more detailed calculations with examples in the mail to idna-update at alvestrand.no with the following identifying details:
> Message-ID: <4AACA7E6.1070503 at it.aoyama.ac.jp>
> Date: Sun, 13 Sep 2009 17:05:58 +0900
> 
> Unfortunately, when I currently try to access the archive at
> http://www.alvestrand.no/pipermail/idna-update/ from https://www.ietf.org/wg/concluded/idnabis.html, I get the following:
> 
> ----
> Forbidden
> 
> You don't have permission to access /pipermail/idna-update/ on this server.
> 
> Apache/2.4.7 (Ubuntu) Server at www.alvestrand.no Port 80
> ----
> 
> I have cc'ed Harald in the hope that the archive can be fixed soon.
> 
> 
> I'm coping the relevant part of that mail here:
> 
> >>>>>>>>
> Here are my calculations. After a few tests, one finds out that punycode
> uses a single 'a' to express 'one more of the same character'. The
> question is then how many characters it takes punycode to express the
> first character. Expressing that first character takes more and more
> punycode characters as its Unicode number gets higher, so one has to
> test with the smallest Unicode character that needs a certain number of
> bytes in UTF-8. Going through lengths 1,2,3, and 4 per character in
> UTF-8, we find:
> 
> 1 octet per character in UTF-8:
> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org gives
> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
> and has 63 characters, so 63 octets in UTF-8, 126 octets in UTF-16, and
> 252 octets in UTF-32.
> 
> 2 octets per character in UTF-8:
> ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢.org gives
> xn--8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
> and has 58 characters, so 116 octets in UTF-8, 116 octets in UTF-16, and
> 232 octets in UTF-32. 59 seems possible in theory, but impossible in
> practice.
> 
> ँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँँ.org (using the currently lowest encoded character that needs 3 bytes,
> U+0901, DEVANAGARI SIGN CANDRABINDU), gives
> xn--h1baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
> and has 57 characters, so 171 octets in UTF-8, 114 octets in UTF-16, and
> 228 octets in UTF-32. Please note that even characters in the U+0800
> range would need that much, because already a character such as 'ü'
> needs that much.
> 
> Trying to assess how many characters one could use of
> 𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀𐌀.org 
> (using U+10300, OLD ITALIC LETTER A, the lowest character in Unicode 3.2
> that needs 4 bytes in UTF-8) gives
> xn--097caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.org
> and has 56 characters, so 224 octets in UTF-8, 224 octets in UTF-16, and
> 224 octets in UTF-32.
> 
> Overall, we get a maximum label length in octets of 252 octets for
> UTF-32 (with US-ASCII), and 224 octets in UTF-8 and UTF-16 (with Old
> Italic and the like).
> >>>>>>>>
> 
> Regards,   Martin.
> 
> On 2016/05/18 00:58, RFC Errata System wrote:
>> The following errata report has been submitted for RFC5890,
>> "Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework".
>> 
>> --------------------------------------
>> You may review the report below and at:
>> http://www.rfc-editor.org/errata_search.php?rfc=5890&eid=4695
>> 
>> --------------------------------------
>> Type: Technical
>> Reported by: Juan Altmayer Pizzorno <juan at sparkpost.com>
>> 
>> Section: 2.3.2.1
>> 
>> Original Text
>> -------------
>> expansion of the A-label form to a U-label may produce strings that are
>> much longer than the normal 63 octet DNS limit (potentially up to 252
>> characters)
>> 
>> Corrected Text
>> --------------
>> expansion of the A-label form to a U-label may produce strings that are
>> much longer than the normal 63 octet DNS limit (potentially up to 59
>> Unicode code points or 236 octets)
>> 
>> Notes
>> -----
>> The contents of U-labels are encoded in the up to 59 ASCII characters (see 2.3.2.1 itself)
>> output by the Punycode algorithm in their corresponding A-labels.  The Punycode
>> decoder (https://tools.ietf.org/html/rfc3492#section-6.2) consumes at least one
>> of those ASCII characters for each code point inserted into the U-label. An U-label,
>> thus, can contain at the most 59 Unicode code points.
>> 
>> Since U-labels are defined (in 2.3.2.1) to be expressed in a standard Unicode Encoding
>> Form, and UTF-32, UTF-16 and UTF-8 (as revised by RFC3629) all can encode a code
>> point in at most 4 octets, 236 octets is an upper bound for an U-label's length.
>> 
>> I think it should be possible to derive a tighter bound, but its rationale would likely be
>> less straighforward.
>> 
>> I imagine the number 252 was originally derived by multiplying 63, the maximum
>> length of an A-label (including the "xn--" prefix), by 4, the maximum number of
>> octets needed to represent a code point.
>> 
>> Instructions:
>> -------------
>> This erratum is currently posted as "Reported". If necessary, please
>> use "Reply All" to discuss whether it should be verified or
>> rejected. When a decision is reached, the verifying party (IESG)
>> can log in to change the status and edit the report, if necessary.
>> 
>> --------------------------------------
>> RFC5890 (draft-ietf-idnabis-defs-13)
>> --------------------------------------
>> Title               : Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework
>> Publication Date    : August 2010
>> Author(s)           : J. Klensin
>> Category            : PROPOSED STANDARD
>> Source              : Internationalized Domain Names in Applications (Revised)
>> Area                : Applications
>> Stream              : IETF
>> Verifying Party     : IESG
>> 
>> _______________________________________________
>> Idna-update mailing list
>> Idna-update at alvestrand.no
>> http://www.alvestrand.no/mailman/listinfo/idna-update
>>