Potential Erratum re. length limits in RFC 5890
Kenneth Whistler
kenw at sybase.com
Wed Sep 29 00:02:54 CEST 2010
John Klensin said:
> (3) My recollection is that the 252 number came from Ken
Not me.
> or Mark
> after discussion of the number of code points 63 user-abstract
> characters could turn into given combining forms.
It has nothing to do with combining characters.
> The statement
> in the text was written --again, IIR after considerable WG
> discussion-- as advice about how long the strings could get, not
> a normative limit. At a minimum, I'd like to see if they can
> reconstruct the reasoning for that number,
The reasoning is quite simple. It has to do with Unicode
encoding forms (and again, nothing whatsoever to do with
combining characters).
63 encoded characters (Unicode code points) have the
following minimum and maximum lengths (expressed in octets),
depending on encoding forms and which particular characters are
involved.
For 63 characters in the ASCII range (U+0020..U+007E)
UTF-8 = 63 octets
UTF-16 = 126 octets
UTF-32 = 252 octets
For 63 character from the supplementary planes (U+10000 and above)
UTF-8 = 252 octets
UTF-16 = 252 octets
UTF-32 = 252 octets
Those are the minimum and maximum cases. For some more
typical mix of characters from the BMP, the UTF-8 length
will be >= 63 and <= 252 octets.
That's it... no mumbo-jumbo involved about what a user
perceives of as a character or what number of combining
characters can be applied to a base character or any of that.
--Ken
> or if someone has the
> energy to search the discussion archives, before issuing any
> errata.
More information about the Idna-update
mailing list