Definitions limit on label length in UTF-8

Thu Sep 17 02:31:33 CEST 2009

--On Wednesday, September 16, 2009 17:03 +0900 "\"Martin J.
Dürst\"" <duerst at it.aoyama.ac.jp> wrote:

> On 2009/09/15 21:49, John C Klensin wrote:
> 
>> And, fwiw, it is worth remembering that "length in codepoints"
>> is not the same as "length in characters" as understood by
>> most casual users, i.e., "length in print positions" or the
>> equivalent.  For scripts that, because of the way Unicode is
>> structured, require the use of a lot of combining characters,
>> "length in number of print positions" may be significant
>> shorter than "length in codepoints" -- one can imagine half
>> as long or even shorter with carefully-constructed
>> (pathological) strings.
> 
> [mostly off-topic]
> In addition, "number of print positions" is in itself a rather
> vague and not very useful concept for scripts that don't have
> much of a tradition of using the same width for each character.

Actually, having spent many years struggling with that one in
programming language design, you are correct only to the extent
that the terminology matches two separate concepts.  One of them
is useful only for computing "line lengths" and justification
and hence not very useful in the general case at all.   The
other remains useful for all sorts of things involving one of
several "conceptual string lengths", few of which correspond to
"Unicode code point length" or "octet length" except for a
narrow subset of scripts and characters.

But, as you say, it is mostly off-topic.

    john