New Last Call: 'Tags for Identifying Languages' to BCP

Mon Dec 13 07:05:04 CET 2004

> From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> bounces at alvestrand.no] On Behalf Of Bruce Lilly

> > As mentioned, the limit is imposed by other tight constraints on
> 'grandfathered'; you have already identified that the longest
registered
> tag under RFC 3066 is 11 octets in length, therefore a 'grandfathered'
tag
> can be at most 11 octets in length.
> 
> But the constraints probably aren't as tight as you
> believe; the draft specifically permits a future
> revision to allow a primary subtag longer than
> 8 octets, or not purely alphabetic, etc.

RFC 3066 does not impose any restrictions on what its replacements might
do. This is the case with any specification: a given technical
specification is not a specification of human behaviour and cannot keep
us from revising the spec or replacing it in any way we may choose.

> One would hope that under RFC 3066 rules, that the
> reviewer, a list subscriber, or an Applications Area
> Director would recognize the conflict with RFCs 2047/2231
> and would object.

You have mentioned conflict with RFCs 2047 and 2231. RFC 2047 does not
make reference to language tags. The ABNF of RFC 2231 does not impose
any limit on the length of language tags. RFC does contain an implicit
length issue in that it updates RFC 2047, allowing language tags within
encoded words, but it does not explicitly identify any upper bound on
the length of language tags. By reading both RFC 2047 and RFC 2231, one
finds that they assume that a language tag must be at most 64 characters
long:

- the maximum length for the encoded-word production is 75 characters
long (not stated in the ABNF of RFC 2047 but rather in the prose)

- encoded-word production of RFC 2047 includes 6 literal characters

- RFC 2231 adds one delimiting character "*" between the charset and
language tag

- the shortest charset names are 2 characters long (e.g. "IT")

- the shortest encoding length is 1 character long

- the minimum encoded-text length is 1 character long

An encoded-word must contain at least 11 characters that are not part of
the language tag and have a total length of no more than 75 characters.
Therefore, an upper bound on language tags that can be used in an RFC
2047/2231 encoded-word production is 64 characters. In many cases, where
the charset tag or encoding is longer, the upper bound on the length of
languages tags will be less, but the RFC gives no estimate or indication
of how much less.

This is a constraint on an application of RFC 3066; it is not a
constraint on RFC 3066 itself. It is possible that other applications of
RFC 3066 may impose limits that may be longer or shorter than that
imposed by RFC 2047/2231. I see no reason why limits must be added as a
constraint in a revision of RFC 3066. It would be a good idea, however,
to point out in section 2.1 of the draft that some applications of this
specification may impose limits on the length of accepted language tags,
and perhaps to cite RFC 2231 as an example.

My suggestions, then, in response to Bruce Lilley's comments are:

- that we add a note prominently in section 2.1 of the draft explaining
that some applications may impose limits on the lengths of language
tags, and cite RFC 2231 as an example

- that we revise the ABNF for the 'grandfathered' production rule to 

	grandfathered = 1*3ALPHA *("=" 1*8alphanum)

- that we add a note in the discussion of extensions stating that, when
a language tag instance is to be used in a specific, known protocol, it
is advisable that the language tag not include extensions not supported
by that protocol (text can be added pointing out the inadvisability of
including unrecognized extensions in the case of protocols that impose
upper limits on the length of strings that may contain a language tag)

- that recommendation 4 in section 2.4.2 be changed to say that
extensions should not be removed except in the case that the language
tag instance is to be inserted into a specific protocol known not to
support the extension

- that the language subtag registration form include an additional field
following #7 (recommended prefixes for variants) asking for a reasonable
estimate and examplar of the maximum length anticipated for language
tags using the requested varient

- that a requirement on extension RFCs be added in section 3.4 stating
that they must include some explicit discussion of concerns related to
upper bounds on length of language tags using the given extension

- that we do not attempt any other changes to the ABNF to impose an
upper bound on the length of language tags

- that we add a note in section 3.1 indicating that descriptions in
registry entries for ISO 639, ISO 3166 or ISO 15924 identifiers are
intended only to indicate the meaning of that identifier as defined in
the source ISO standard at the time it was added to the registry, and
that the descriptions are not replacements for content of the source
standards themselves

- that we do not need to change the proposed format of the registry to
include descriptions in multiple languages

Peter Constable
Microsoft Corporation