Potential Erratum re. length limits in RFC 5890

Thu Sep 30 03:31:09 CEST 2010

Ken is right about the maximal source label length being at least 252 in the
absence of mapping.

With the use of mapping, however, it could be substantially longer. This can
happen a series of characters in the source can map to a single character,
and then are mapped to a single byte in Punycode. That can happen with
IDNA2008, or with UTS46 (or any other mapping preprocessing for IDNA2008).

So it is best to just avoid a mention of a limit like 252; either that or
explain the situation in more detail.

====

Details. As illustration, suppose that you had the following, in UTF32.

00 00 00 41 00 00 03 08 00 00 03 04

That sequence, when normalized to NFC, yields

U+01DE ( Ǟ ) LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON, one
character.

Repeat it 57 times. That is of length 684.

When normalized under NFC, you get

ǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞǞ

That turns into the valid Punycode:

xn--bkaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Mark

*— Il meglio è l’inimico del bene —*

On Wed, Sep 29, 2010 at 04:27, John C Klensin <klensin at jck.com> wrote:

> Thanks.
>   john
>
>
> --On Tuesday, September 28, 2010 15:02 -0700 Kenneth Whistler
> <kenw at sybase.com> wrote:
>
> > John Klensin said:
> >
> >> (3) My recollection is that the 252 number came from Ken
> >
> > Not me.
> >
> >> or Mark
> >> after discussion of the number of code points 63 user-abstract
> >> characters could turn into given combining forms.
> >
> > It has nothing to do with combining characters.
> >
> >> The statement
> >> in the text was written --again, IIR after considerable WG
> >> discussion-- as advice about how long the strings could get,
> >> not a normative limit.    At a minimum, I'd like to see if
> >> they can reconstruct the reasoning for that number,
> >
> > The reasoning is quite simple. It has to do with Unicode
> > encoding forms (and again, nothing whatsoever to do with
> > combining characters).
> >
> > 63 encoded characters (Unicode code points) have the
> > following minimum and maximum lengths (expressed in octets),
> > depending on encoding forms and which particular characters are
> > involved.
> >
> > For 63 characters in the ASCII range (U+0020..U+007E)
> >
> >    UTF-8  =  63 octets
> >    UTF-16 = 126 octets
> >    UTF-32 = 252 octets
> >
> > For 63 character from the supplementary planes (U+10000 and
> > above)
> >
> >    UTF-8  = 252 octets
> >    UTF-16 = 252 octets
> >    UTF-32 = 252 octets
> >
> > Those are the minimum and maximum cases. For some more
> > typical mix of characters from the BMP, the UTF-8 length
> > will be >= 63 and <= 252 octets.
> >
> > That's it... no mumbo-jumbo involved about what a user
> > perceives of as a character or what number of combining
> > characters can be applied to a base character or any of that.
> >
> > --Ken
> >
> >> or if someone has the
> >> energy to search the discussion archives, before issuing any
> >> errata.
> >
>
>
>
>
> ______________________________ _________________
> Idna-update mailing list
> Idna-update at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/idna-update/attachments/20100929/b36795ca/attachment.html>