New Last Call: 'Tags for Identifying Languages' to BCP

Wed Dec 15 16:39:31 CET 2004

>  Date: 2004-12-13 01:05
>  From: "Peter Constable" <petercon at microsoft.com>
>  To: ietf-languages at alvestrand.no, ietf at ietf.org

> RFC 3066 does not impose any restrictions on what its replacements might
> do. This is the case with any specification: a given technical
> specification is not a specification of human behaviour and cannot keep
> us from revising the spec or replacing it in any way we may choose.

It's not clear exactly who is meant by "us", but I'll leave
that to a separate message.  It is considered bad practice
for a document which obsoletes another document to depend
on the obsoleted document for definitions or other interpretation
of the meaning of what is contained in the successor document.

> You have mentioned conflict with RFCs 2047 and 2231. RFC 2047 does not
> make reference to language tags. The ABNF of RFC 2231 does not impose
> any limit on the length of language tags. RFC does contain an implicit
> length issue in that it updates RFC 2047, allowing language tags within
> encoded words, but it does not explicitly identify any upper bound on
> the length of language tags. By reading both RFC 2047 and RFC 2231, one
> finds that they assume that a language tag must be at most 64 characters
> long:

You have missed several important and not-so-subtle points.
One of which is that RFC 2231 explicitly amends RFC 2047; it
clearly so states in the first page heading and in the text,
and is also indicated in the RFC Index. Another is that
neither uses ABNF; both use EBNF as defined in RFC 822.
More details on specific missed points below:

> - the shortest charset names are 2 characters long (e.g. "IT")

Not all charsets have 2-character names. Not all two-character
names which might be assigned are suitable for MIME use. Where
a preferred MIME name is indicated, that should be used.

> - the minimum encoded-text length is 1 character long

That is strictly only true for text that meets all of the
following conditions:
a) is representable in a specified subset of ANSI X3.4, and
   therefore requires no encoding
b) does not use any encoding, even if unnecessary
c) does not use a charset and character sequence involving
   shift sequences (e.g. as in ISO 2022-like charsets)

It also misses the point that using 76+ octets to represent
a single octet is rather wasteful.

Any use of B encoding will require a multiple of 4 octets
of encoded text. Q encoding has some special cases, but
typically requires 3 octets or more.

> An encoded-word must contain at least 11 characters that are not part of
> the language tag and have a total length of no more than 75 characters.
> Therefore, an upper bound on language tags that can be used in an RFC
> 2047/2231 encoded-word production is 64 characters.

That is a best case upper bound, for text which requires
no encoding at all, one character per encoded-word.

> In many cases, where 
> the charset tag or encoding is longer, the upper bound on the length of
> languages tags will be less, but the RFC gives no estimate or indication
> of how much less.

The worst case appears to be the charset named
Extended_UNIX_Code_Fixed_Width_for_Japanese (43 characters),
which in fact uses ISO 2022-like sequences. That is the
primary name for that charset; there is no preferred MIME
alias, and the only other alias is the one specified for
printer MIB use. Shifted characters are represented by two
octets, each of which requires encoding. The shift sequences
are 3 octets each, and RFC 2047 requires that an encoded-word
start and begin in unshifted state.  Therefore the
minimum amount of encoded text for a single character in
a shifted subset consists of an encoding of: a 3 octet
shift sequence (one of which requires encoding), 2 octets
representing the single character (both requiring encoding),
and 3 octets restoring the unshifted state (one requiring
encoding). Using B encoding results in 12 octets of encoded
text as a minimum (Q-encoding would require a minimum of 16
octets). So a single character in a shifted subset of that
particular charset, using B encoding, leaves at most 12 octets
for a language-tag.  As mentioned, use of an encoded-word
plus the necessary whitespace around it to represent a
single character is rather wasteful, so a brief language tag
is indicated; fortunately "ja" suffices for text likely to
be used with that charset.

> This is a constraint on an application of RFC 3066; it is not a
> constraint on RFC 3066 itself. It is possible that other applications of
> RFC 3066 may impose limits that may be longer or shorter than that
> imposed by RFC 2047/2231.

Yes, and it is sometimes desirable to transfer text and
tag from one application to another.  For example, text in
the body of a message can have language indicated by a
Content-Language header field, where there is up to 997
octets available for a language tag.  However a response
regarding some portion of that message might well indicate
the topic of the response in the response message's Subject
field, where encoded-word limits apply.

> I see no reason why limits must be added as a 
> constraint in a revision of RFC 3066.

The primary reason for specifying limits is due to the
proposed removal of the review/registration process
which currently limits the length of non-private-use
tags.

> It would be a good idea, however, 
> to point out in section 2.1 of the draft that some applications of this
> specification may impose limits on the length of accepted language tags,
> and perhaps to cite RFC 2231 as an example.

As a general principle, that's fine, however I would point
out that given the inability of experts to be able to
accurately point out the limits quickly (I neglected the
shift sequence constraints in an earlier analysis, and
Peter missed several points about encoded text etc.), I do
not think it is sufficient merely to state the fact that
there are limits, with or without a pointer to RFC 2231 as
an example.  Some indication of the magnitude of worst-case
restrictions is at least advisable, and it is necessary to
point out that generous limits imposed by a particular
portion of a protocol, coupled with reuse of the text and
tag in a different portion of that protocol or in a different
protocol, may impose shorter limits that are not readily
apparent from consideration of only a subset of any single
protocol.