Potential Erratum re. length limits in RFC 5890

Kenneth Whistler kenw at sybase.com
Wed Sep 29 00:02:54 CEST 2010


John Klensin said:

> (3) My recollection is that the 252 number came from Ken

Not me.

> or Mark
> after discussion of the number of code points 63 user-abstract
> characters could turn into given combining forms.

It has nothing to do with combining characters.

> The statement
> in the text was written --again, IIR after considerable WG
> discussion-- as advice about how long the strings could get, not
> a normative limit.    At a minimum, I'd like to see if they can
> reconstruct the reasoning for that number, 

The reasoning is quite simple. It has to do with Unicode
encoding forms (and again, nothing whatsoever to do with
combining characters).

63 encoded characters (Unicode code points) have the
following minimum and maximum lengths (expressed in octets),
depending on encoding forms and which particular characters are
involved.

For 63 characters in the ASCII range (U+0020..U+007E)

   UTF-8  =  63 octets
   UTF-16 = 126 octets
   UTF-32 = 252 octets
   
For 63 character from the supplementary planes (U+10000 and above)

   UTF-8  = 252 octets
   UTF-16 = 252 octets
   UTF-32 = 252 octets
   
Those are the minimum and maximum cases. For some more
typical mix of characters from the BMP, the UTF-8 length
will be >= 63 and <= 252 octets.
   
That's it... no mumbo-jumbo involved about what a user
perceives of as a character or what number of combining
characters can be applied to a base character or any of that.

--Ken   

> or if someone has the
> energy to search the discussion archives, before issuing any
> errata.  



More information about the Idna-update mailing list