[Technical Errata Reported] RFC5890 (4695)

Sun May 22 18:54:38 CEST 2016

(adding Patrik, Pete, Paul, and Cary as well -- I don't believe
the idna-update list is being archived, certainly the archives
are not accessible, and I have no idea whether addresses are
up-to-date.  I have tried to call Harald's attention to that for
both the main and design team lists on several occasions, but
everyone is busy.   Because of the additions, I'm including the
whole thread from Martin's message below my comments. )

--On Sunday, May 22, 2016 15:40 +0900 "Martin J. Dürst"
<duerst at it.aoyama.ac.jp> wrote:

> [What I say below also applies to erratum 4696. If it's
> desirable to reply to that with the same comment, please let
> me know.]
> 
> I believe that Juan is essentially right.
> 
> This has come up before before, and possibly already noted by
> John Klensin for fixing in an eventual update.

Yes, it has been noted.  It has also been noted that there
doesn't seem to be significant excitement in the community for
such an update.  In addition, doing serious work on an update,
with or without trying to advance IDNA2008 to Full Standard, is
presumably blocked by the IETF's lack of ability and/or
inclination to address the non-decomposing character problem [1].

I can add only two things to the present discussion.  FIrst, I
don't know where the number in question came from.  I am quite
certain that it did not come out of any calculation I made, but
I can't prove that or try to figure out where it did come from
without spending many hours digging through the mail archives
and the various versions of the I-Ds.  Second, I have absolutely
no objection to a "hold for future revision" erratum that
identifies the issue.  As far as the quest for a real number is
concerned, the variability involved with different string
scenarios is such that I believe the document should including
both a minimum and a maximum string length, not one or the
other.  Equally important, while Juan and I corresponded about
this a bit before the proposed erratum was submitted, I have
serious doubts that this question is worth a lot of community
(or even AD and author) energy while we defer questions such as
the non-decomposing character one or even the viability of
IDNA2008 as long as general practice in many places is for
applications looking up names to follow the recommendations of
UTR#46 (and the model of IDNA2003) and not make string validity
checks before applying the Punycode algorithm.

  best,
    john

[1] See the now-expired draft-klensin-idna-5892upd-unicode70-04.
Note that a somewhat improved -05 has been lying around, and
gradually acquiring new material, since well before -04 expired,
but there has been not point in posting it as long as neither
the IETF nor the IAB's i18n program are willing and able to
engage.

   --------------

> 
> I provided some more detailed calculations with examples in
> the mail to idna-update at alvestrand.no with the following
> identifying details:
> Message-ID: <4AACA7E6.1070503 at it.aoyama.ac.jp>
> Date: Sun, 13 Sep 2009 17:05:58 +0900
> 
> Unfortunately, when I currently try to access the archive at
> http://www.alvestrand.no/pipermail/idna-update/ from
> https://www.ietf.org/wg/concluded/idnabis.html, I get the
> following:
> 
> ----
> Forbidden
> 
> You don't have permission to access /pipermail/idna-update/ on
> this server.
> 
> Apache/2.4.7 (Ubuntu) Server at www.alvestrand.no Port 80
> ----
> 
> I have cc'ed Harald in the hope that the archive can be fixed
> soon.
> 
> 
> I'm coping the relevant part of that mail here:
> 
>  >>>>>>>>
> Here are my calculations. After a few tests, one finds out
> that punycode
> uses a single 'a' to express 'one more of the same character'.
> The
> question is then how many characters it takes punycode to
> express the
> first character. Expressing that first character takes more
> and more
> punycode characters as its Unicode number gets higher, so one
> has to
> test with the smallest Unicode character that needs a certain
> number of
> bytes in UTF-8. Going through lengths 1,2,3, and 4 per
> character in
> UTF-8, we find:
> 
> 1 octet per character in UTF-8:
> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> a.org gives
> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> a.org
> and has 63 characters, so 63 octets in UTF-8, 126 octets in
> UTF-16, and
> 252 octets in UTF-32.
> 
> 2 octets per character in UTF-8:
> ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢
> ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢.org
> gives
> xn--8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> a.org
> and has 58 characters, so 116 octets in UTF-8, 116 octets in
> UTF-16, and
> 232 octets in UTF-32. 59 seems possible in theory, but
> impossible in
> practice.
> 
> ँँँँँँँँँँँँँँँँँँँँऍ
>
ँँँँँँँँँँँँँँँँँँँँ͊>
$ँँँँँँँँँँँँँँँ.org (using the
> currently lowest encoded character that needs 3 bytes,
> U+0901, DEVANAGARI SIGN CANDRABINDU), gives
> xn--h1baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> a.org
> and has 57 characters, so 171 octets in UTF-8, 114 octets in
> UTF-16, and
> 228 octets in UTF-32. Please note that even characters in the
> U+0800
> range would need that much, because already a character such
> as 'ü'
> needs that much.
> 
> Trying to assess how many characters one could use of
> p