New Last Call: 'Tags for Identifying Languages' to BCP

Sun Dec 12 20:27:12 CET 2004

>  Date: 2004-12-11 11:53
>  From: "Peter Constable" <petercon at microsoft.com>
>  To: ietf-languages at alvestrand.no, ietf at ietf.org

> Our disagreement amounts to a basic question of whether parsers should be written based on the ABNF alone, or based on the ABNF plus other constraints provided in the spec. Clearly, I think anyone writing a parser should consider other constraints as well.

No, I agree that a parser should take normative text
into account, but I feel that there should be a
reasonable effort made to make the ABNF agree with
that normative text -- otherwise there's little
point in providing ABNF.

> As mentioned, the limit is imposed by other tight constraints on 'grandfathered'; you have already identified that the longest registered tag under RFC 3066 is 11 octets in length, therefore a 'grandfathered' tag can be at most 11 octets in length.

But the constraints probably aren't as tight as you
believe; the draft specifically permits a future
revision to allow a primary subtag longer than
8 octets, or not purely alphabetic, etc.

> a de-facto upper limit of 11 (subject to change if new tags are registered before the proposed spec is accepted).

We're agreed on that, for the present draft, but
apparently Mark Davis disagrees.  And I am concerned
about the loophole left for future revisions.

> > > We could impose some upper limits on these things...
> 
> > That leaves the extension portions' length at up to
> > 25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts
> > of a tag into account!   That's way too long (the RFC 2047
> > limit for an encoded-word is 75 octets, including charset tag,
> > some text, and some syntactic glue in addition to the language
> > tag).
> 
> The problem already exists in RFC 3066. Even apart from private-use tags, tomorrow someone could request a registration for a tag that's 87 octets long, and there's nothing in RFC 3066 that would prohibit acceptance.

One would hope that under RFC 3066 rules, that the
reviewer, a list subscriber, or an Applications Area
Director would recognize the conflict with RFCs 2047/2231
and would object.  If indeed that were to happen
literally tomorrow, I am quite sure that an objection
would be made.  The situation is quite different
under the draft proposal, where registration of a
complete tag is not required, and where there are
no upper bounds on length of a tag.

> > > So, I think Bruce has identified a valid issue here. I personally would
> > > not have characterized it as greatly exacerbating, though,
> > 
> > IMO, an increase from 11 octets worst-case, which is tolerable
> > for constructing RFC 2047/2231 encoded-words, to >> 1850
> > octets, which exceeds by a large margin what can be handled
> > in a Content-Language or Accept-Language message header
> > field, constitutes "greatly exacerbated".
> 
> Repeating my previous point, RFC 3066 doesn't stop a registered tag from being 10^100 octets in length.

RFC 3066 provides a registration mechanism that can be
trusted to prevent that; in particular, the Applications
Area Directors are supposed to look out for issues affecting
the core Internet applications protocols.

> I suggest that wording be added to the draft giving a strong recommendatation to users that they not use tags the complete length of which exceeds 75 characters.

75 octets would be too large for a language-tag used in
an encoded word (perhaps different limits could be
specified for different uses, but one would have to be
careful about implicit re-use between applications). An
encoded-word has the form:
  =?<charset>*<language-tag>?<encoding>?<text>?=
and is limited to a total of 75 octets. Eliminating the
syntactic glue (7 octets, unbracketed above) leaves a
total of at most 68 octets for text, charset, encoding,
and language-tag.  There are at present two encodings,
specified with 1-octet tags.  Assuming that longer
encoding tags are not required, that leaves 67 octets
for charset, language-tag, and text.  The text must be
at least four octets in order to accommodate B encoded
text, leaving 63 octets at most for charset and
language-tag (ideally, one would prefer to leave more
room than that for text).  It is guaranteed (in theory,
if not in practice) that there will be a charset name
of no more than 40 octets for each charset, but that is
not necessarily the preferred name (there has been some
discussion about possibly reducing that limit). That
leaves about 23 octets for a language-tag as an upper
bound for use in an encoded-word.  Obviously that
hasn't been a problem in practice to date; the longest
registered language tag is less than half that length.

> > By deferring to the bilingual ISO lists for language and country
> > tags, 3066 at least provided a minimal degree of internationalization.
> > By explicitly limiting description fields to English and restricting
> > the charset to US-ASCII, the draft proposal takes a giant leap
> > backwards.
> 
> The US-ASCII limitation existed in RFC 3066, so is not new. 

No, I'm talking about the character set of the description,
which currently resides in the ISO lists, and is certainly
not limited to ANSI X3.4 in those lists.  Under the draft
proposal, the description is limited to ANSI X3.4, which
is a problem for the description for UN region 248, whose
description includes an A-ring character, which is not in
X3.4.  I note that BCP 18 section 3.1 specifies that it MUST
be possible to use the UTF-8 charset, so the specification
of the registry as solely X3.4 appears to violate that
provision of BCP 18.

> On the more general point, I believe you are mistaking i18n concerns with localization concerns:

No, I am concerned about changing what is currently
internationalized (to an admittedly small extent) into
something that is strictly monolingual in a severely
restricted charset.

> > > I don't quite understand what the critique is here: what is there to
> > > internationalize about language tags?
> > 
> > There should probably be a reference (at least informative)
> > pointing to BCP 18 and mentioning that the language tags
> > defined provide a means of labeling the language of text,
> 
> Have you not read the abstract in the draft?
[...]
> Or the introduction?

I have; neither mentions BCP 18 or the core Internet
protocols.

> > The draft (if/when approved) should also indicate that
> > it updates BCP 18, which refers to RFC 1766.
> 
> Is this right? This draft is not a replacement for RFC 2277, or an addendum to it. RFC 2277 also refers to RFC 1958, which was updated by RFC 3439, but surely RFC 3439 doesn't state that it updates BCP 18? (RFC 227 does have a section with significant overlap in topic, though, so perhaps this makes sense. I'm not well-enough versed in IETF document process to know.)

N.B. "update" != "replacement".  If the draft obsoletes
3066, which obsoletes 1766, then it affects 2277.  3066
should probably have so indicated also...

> > Given the divergence noted above from RFC 3066's use
> > of multilingual reference lists, the Internationalization
> > considerations section should include a synopsis of the
> > approach chosen (viz. to restrict description to English) and
> > the rationale for that choice (see BCP 18 section 6).
> 
> Again, this is a localization issue, not an internationalization issue. I do not consider this necessary or even appropriate.

No, it's relevant to the extent that BCP 18 specifies
that text strings are subject to internationalization,
and the description field in the draft-proposed registry
protocol certainly appears to be a text string (although
the draft does not clearly state whether it is a text
string or a protocol element).

> > > >     implications (ISO 8601 date format parsing).
> > >
> > > As mentioned above, this really is a non-issue.
> > 
> > It's an issue (esp. in light of the finger pointing regarding
> > accessibility to ISO 639/3166).
> 
> As has been pointed out, there is no such finger-pointing in the draft.

The finger-pointing accompanied the new last call and was
used as justification for replacement of RFC 3066 with the
proposed scheme.  If indeed accessibility is a non-issue,
then the justification for the proposed scheme, in whole
or in part, rests solely on other considerations, such as
they might be.

> > Again, it is an issue that imposes requirements on language
> > tag parsers.  What you've shown is that the ABNF is not
> > consistent with what was desired to be expressed, and
> > that makes it an issue that needs to be addressed.
> 
> Again, I believe the bigger issue is not getting the ABNF to express what was desired

No it is a concern because of a loophole left for future
revisions to incorporate syntax which is not currently
permitted by 3066, but which is inexplicably permitted by
the draft's proposed ABNF.

> > > The maximal length issue exists just as much
> > > in RFC 3066 due to private-use tags; it is a technical concern that
> > > might worth reviewing in RFC 3066bis, however; but it is not
> > > insurmountable, and not a new problem.
> > 
> > Private-use carries its own considerable baggage; aside from
> > that, the draft proposal increases the length of non-private
> > tags that affect both protocol design and implementations
> > from a worst case maximum of 11 octets under RFC 3066...
> 
> Worst case at present; a month from now it could be unlimitedly larger.

No, the current registration review process in conjunction
with the requirement would prevent that.  The draft proposal
decouples use from registration, which directly leads to
unlimited length, and changes the review process (in unclear
ways which I have not yet had time to review fully).

> But I've accepted that it would be an improvement to add constraints on overall length. 

That's a start.