[Technical Errata Reported] RFC5890 (4695)

Mon May 23 16:28:05 CEST 2016

For what my thoughts (as an implementer) are worth:
It was primarily seeing those statements in terms of
code points (rather than octets), and not the actual
number, that I found confusing and worthy of the errata.

IMO this is a tiny issue and the errata would be
sufficient to save time for anyone stumbling upon it.

I note that the aaa…aaa label from the list isn’t an
U-label, as it lacks a non-ASCII character.

.. Juan

> On May 22, 2016, at 12:54 PM, John C Klensin <john-ietf at jck.com> wrote:
> 
> (adding Patrik, Pete, Paul, and Cary as well -- I don't believe
> the idna-update list is being archived, certainly the archives
> are not accessible, and I have no idea whether addresses are
> up-to-date.  I have tried to call Harald's attention to that for
> both the main and design team lists on several occasions, but
> everyone is busy.   Because of the additions, I'm including the
> whole thread from Martin's message below my comments. )
> 
> --On Sunday, May 22, 2016 15:40 +0900 "Martin J. Dürst"
> <duerst at it.aoyama.ac.jp> wrote:
> 
>> [What I say below also applies to erratum 4696. If it's
>> desirable to reply to that with the same comment, please let
>> me know.]
>> 
>> I believe that Juan is essentially right.
>> 
>> This has come up before before, and possibly already noted by
>> John Klensin for fixing in an eventual update.
> 
> Yes, it has been noted.  It has also been noted that there
> doesn't seem to be significant excitement in the community for
> such an update.  In addition, doing serious work on an update,
> with or without trying to advance IDNA2008 to Full Standard, is
> presumably blocked by the IETF's lack of ability and/or
> inclination to address the non-decomposing character problem [1].
> 
> I can add only two things to the present discussion.  FIrst, I
> don't know where the number in question came from.  I am quite
> certain that it did not come out of any calculation I made, but
> I can't prove that or try to figure out where it did come from
> without spending many hours digging through the mail archives
> and the various versions of the I-Ds.  Second, I have absolutely
> no objection to a "hold for future revision" erratum that
> identifies the issue.  As far as the quest for a real number is
> concerned, the variability involved with different string
> scenarios is such that I believe the document should including
> both a minimum and a maximum string length, not one or the
> other.  Equally important, while Juan and I corresponded about
> this a bit before the proposed erratum was submitted, I have
> serious doubts that this question is worth a lot of community
> (or even AD and author) energy while we defer questions such as
> the non-decomposing character one or even the viability of
> IDNA2008 as long as general practice in many places is for
> applications looking up names to follow the recommendations of
> UTR#46 (and the model of IDNA2003) and not make string validity
> checks before applying the Punycode algorithm.
> 
>  best,
>    john
> 
> [1] See the now-expired draft-klensin-idna-5892upd-unicode70-04.
> Note that a somewhat improved -05 has been lying around, and
> gradually acquiring new material, since well before -04 expired,
> but there has been not point in posting it as long as neither
> the IETF nor the IAB's i18n program are willing and able to
> engage.
> 
> 
>   --------------
> 
>> 
>> I provided some more detailed calculations with examples in
>> the mail to idna-update at alvestrand.no with the following
>> identifying details:
>> Message-ID: <4AACA7E6.1070503 at it.aoyama.ac.jp>
>> Date: Sun, 13 Sep 2009 17:05:58 +0900
>> 
>> Unfortunately, when I currently try to access the archive at
>> http://www.alvestrand.no/pipermail/idna-update/ from
>> https://www.ietf.org/wg/concluded/idnabis.html, I get the
>> following:
>> 
>> ----
>> Forbidden
>> 
>> You don't have permission to access /pipermail/idna-update/ on
>> this server.
>> 
>> Apache/2.4.7 (Ubuntu) Server at www.alvestrand.no Port 80
>> ----
>> 
>> I have cc'ed Harald in the hope that the archive can be fixed
>> soon.
>> 
>> 
>> I'm coping the relevant part of that mail here:
>> 
>>>>>>>>>> 
>> Here are my calculations. After a few tests, one finds out
>> that punycode
>> uses a single 'a' to express 'one more of the same character'.
>> The
>> question is then how many characters it takes punycode to
>> express the
>> first character. Expressing that first character takes more
>> and more
>> punycode characters as its Unicode number gets higher, so one
>> has to
>> test with the smallest Unicode character that needs a certain
>> number of
>> bytes in UTF-8. Going through lengths 1,2,3, and 4 per
>> character in
>> UTF-8, we find:
>> 
>> 1 octet per character in UTF-8:
>> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>> a.org gives
>> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>> a.org
>> and has 63 characters, so 63 octets in UTF-8, 126 octets in
>> UTF-16, and
>> 252 octets in UTF-32.
>> 
>> 2 octets per character in UTF-8:
>> ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢
>> ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢.org
>> gives
>> xn--8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>> a.org
>> and has 58 characters, so 116 octets in UTF-8, 116 octets in
>> UTF-16, and
>> 232 octets in UTF-32. 59 seems possible in theory, but
>> impossible in
>> practice.
>> 
>> ँँँँँँँँँँँँँँँँँँँँऍ
>> 
> ँँँँँँँँँँँँँँँँँँँँ͊>
> $ँँँँँँँँँँँँँँँ.org (using the
>> currently lowest encoded character that needs 3 bytes,
>> U+0901, DEVANAGARI SIGN CANDRABINDU), gives
>> xn--h1baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>> a.org
>> and has 57 characters, so 171 octets in UTF-8, 114 octets in
>> UTF-16, and
>> 228 octets in UTF-32. Please note that even characters in the
>> U+0800
>> range would need that much, because already a character such
>> as 'ü'
>> needs that much.
>> 
>> Trying to assess how many characters one could use of
>> p
> 
> 
> 
>