New Last Call: 'Tags for Identifying Languages' to BCP

Sat Dec 18 03:51:09 CET 2004

>  Date: 2004-12-15 14:41
>  From: "Peter Constable" <petercon at microsoft.com>
>  To: ietf-languages at alvestrand.no, ietf at ietf.org
[...]
> How is it possible to predict ahead of time what is the worst-case
> length for a RFC3066-registered language tag?

In some contexts, the length is limited by the context
(e.g. encoded-words, Content-Language fields in an
Internet Message).

> Neither is possible. In light of that, I think it best to make sure
> implementers of the revised RFC 3066 be reminded that some
> implementations may impose limits (whether those implementers be
> constructing tags or passing them from one process to another), and for
> implementers to incorporate robustness into their implementations so
> that they can respond gracefully if an unexpectedly-long tag is
> encountered -- after all, no matter what limit could be imposed in a
> revision to RFC 3066, there's no way to stop malware from sending bad
> data.
> 
> (How *do* encoded-word parsers react if a bogus charset or language tag
> that's 2k octets long is encountered?

By definition, that cannot happen. No encoded-word may be
longer than 75 octets.  A sequence longer than that limit,
even if it matches all other characteristics of an
encoded-word, is treated as ordinary ASCII text (RFC 2047,
section 6.1, paragraph marked "(1)").  No header field
line may be longer than 998 octets (not counting the
terminating CRLF pair), so 2k is simply not permitted.

> The encoded-word spec already 
> allows for segmenting long strings;

To be a bit more precise, it permits text to be encoded to
be split across multiple encoded-words (with several
restrictions); the encoded-words themselves cannot be in
any way segmented or split.  That is because an encoded-word
is treated by a MIME-unaware application as a single RFC
[2]822 word.

> could it not also be revised to 
> allow segmenting for the parameters, which would also make it more
> robust?)

If you're referring to RFC 2231 extensions to Content-Type
and Content-Disposition field parameters, that's a separate
matter.

In general, though, as MIME has been around for more than a
decade and Internet Messages for more than three decades,
with a substantial installed base of interoperating
implementations, in what has become one of the core Internet
protocols, any changes would have to be backwards compatible
or would have to be negotiated between sender and receiver
at the same protocol level, or would require a lengthy
transition period before pulling the rug out from under
existing implementations.  It's probably more likely that a
separate next-generation system would be implemented first.