New Last Call: 'Tags for Identifying Languages' to BCP

Peter Constable petercon at
Wed Dec 15 20:41:15 CET 2004

> From: ietf-languages-bounces at [mailto:ietf-languages-
> bounces at] On Behalf Of Bruce Lilly

> > By reading both RFC 2047 and RFC 2231, one
> > finds that they assume that a language tag must be at most 64
> > long...

> > - the shortest charset names are 2 characters long (e.g. "IT")
> Not all charsets have 2-character names...

In determining the longest language tag permitted, one must identify the
shortest possibilities for all other components. 

> > - the minimum encoded-text length is 1 character long
> That is strictly only true for text that meets all of the
> following conditions...

Hey, I just said what the EBNF said.

> > An encoded-word must contain at least 11 characters that are not
part of
> > the language tag and have a total length of no more than 75
> > Therefore, an upper bound on language tags that can be used in an
> > 2047/2231 encoded-word production is 64 characters.
> That is a best case upper bound...

I identified it as such.

> The worst case appears to be the charset named
> Extended_UNIX_Code_Fixed_Width_for_Japanese (43 characters)...

> As mentioned, use of an encoded-word
> plus the necessary whitespace around it to represent a
> single character is rather wasteful, so a brief language tag
> is indicated; fortunately "ja" suffices for text likely to
> be used with that charset.

Of course, the length limitations must be balanced between the charset
tag, the language tag and the encoded-word itself.

> > I see no reason why limits must be added as a
> > constraint in a revision of RFC 3066.
> The primary reason for specifying limits is due to the
> proposed removal of the review/registration process
> which currently limits the length of non-private-use
> tags.

The review/registration process for RFC 3066 registrations does not
impose pre-defined limits that implementers of RFC 3066 can assume in
their parsers.

> > It would be a good idea, however,
> > to point out in section 2.1 of the draft that some applications of
> > specification may impose limits on the length of accepted language
> > and perhaps to cite RFC 2231 as an example.
> As a general principle, that's fine, however I would point
> out that given the inability of experts to be able to
> accurately point out the limits quickly...  I do
> not think it is sufficient merely to state the fact that
> there are limits, with or without a pointer to RFC 2231 as
> an example.  Some indication of the magnitude of worst-case
> restrictions is at least advisable...

How is it possible to identify what is the worst-case bound assumed in
implementations that are out there?

How is it possible to predict ahead of time what is the worst-case
length for a RFC3066-registered language tag?

Neither is possible. In light of that, I think it best to make sure
implementers of the revised RFC 3066 be reminded that some
implementations may impose limits (whether those implementers be
constructing tags or passing them from one process to another), and for
implementers to incorporate robustness into their implementations so
that they can respond gracefully if an unexpectedly-long tag is
encountered -- after all, no matter what limit could be imposed in a
revision to RFC 3066, there's no way to stop malware from sending bad

(How *do* encoded-word parsers react if a bogus charset or language tag
that's 2k octets long is encountered? The encoded-word spec already
allows for segmenting long strings; could it not also be revised to
allow segmenting for the parameters, which would also make it more

Peter Constable
Microsoft Corporation

Ietf mailing list
Ietf at

More information about the Ietf-languages mailing list