Definitions limit on label length in UTF-8

Tue Sep 15 03:57:58 CEST 2009

On 2009/09/14 23:47, John C Klensin wrote:
>
> --On Monday, September 14, 2009 12:24 +0200 Harald Alvestrand
> <harald at alvestrand.no>  wrote:

>> Documenting these 3 numbers as "an U-label can't get longer
>> than that and fit into an A-label" seems sufficient to avoid
>> the spectre of "unlimited length" to me.
>
> While people will probably want to debate the precise way I
> handled this, I simply documented the maximum (252) in Defs-11
> (now posted).  The reasons were:
>
> (i) I didn't think much would be served by the added complexity.
> Those who want to try to save a few octets can make their own
> calculations.

Fine with me.

> (ii) I think Martin's calculation may be wrong.  For example, if
> one built a label entirely with characters that require
> surrogate pairs, UTF-16 and UTF-32 are the same length.

For those labels that require surrogate pairs, indeed, the lengths are 
the same (namely max. 224 octets). The longer overall length limit for 
UTF-32 stems from the fact that even US-ASCII characters take four bytes 
in UTF-32.

Of course, there is still some possiblity that there's an error in my 
calculations. Cross-checking would be welcome.

As for limits in codepoints, that limit is 63 codepoints. But in all 
cases, these limits only apply to valid Unicdoe, not to stuff before 
mapping.

Regards,   Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst at it.aoyama.ac.jp