New Last Call: 'Tags for Identifying Languages' to BCP

Sat Dec 11 17:53:26 CET 2004

> From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> bounces at alvestrand.no] On Behalf Of Bruce Lilly

> > The "grandfathered" production in the current draft is
> >
> > grandfathered   = ALPHA *(alphanum / "-")
> >
> > which does permit the sequences claimed by Bruce (except for
> > not-purely-alphabetic primary sub-tags),
> 
> No exception.  "alphanum" is ALPHA / DIGIT.

My mistake; again, I had on my mind constaints beyond the ABNF.

> > syntactically; but the set of
> > tags available for use is constrained by more than the ABNF syntax
> > alone: the acceptable productions for each sub-tag must either be taken
> > from one of the source standards or be registered.
> 
> So what? The ABNF is an expression of the grammar that
> describes the set of all valid tags.

It is *part* of the expression of the grammar. Even in RFC 3066 this is the case: you know that t-abc is not valid under RFC 3066, but not because that is constrained by the ABNF of RFC 3066.

I will accept that the ABNF of draft should be changed to better reflect what the form of grandfathered productions can be, which, as I stated in my previous message, would be the equivalent of the ABNF of RFC 3066:

grandfathered = 1*8ALPHA *("-" 1*8alphanum)

I think that's an improvement, though technically I don't think it changes anything.

> If
> one doesn't intend to impose such requirements, the
> ABNF specifying the grammar should be changed
> accordingly.
> 
> > This is no different
> > from RFC 3066, so it is no more of a problem in this specification than
> > it was in RFC 3066.
> 
> It is a very different grammar from RFC 3066, imposing
> very different requirements on parsers.

Our disagreement amounts to a basic question of whether parsers should be written based on the ABNF alone, or based on the ABNF plus other constraints provided in the spec. Clearly, I think anyone writing a parser should consider other constraints as well.

> > > In particular, tags other than private-use tags with more than
> > > two subtags require registration under RFC 3066 rules, and it
> > > is a trivial matter to determine the longest registered tag.
> > > The draft, however, encourages use of more subtags as well as
> > > removal of the subtag length upper bound; moreover, it permits
> > > infinite numbers of subtags without requiring registration of
> > > the resulting complete tag.
> >
> > Bruce states incorrectly that there is no upper bound on the length of
> > sub-tags.
> 
> Look again at the draft definition of "grandfathered" -- now
> show me where there's a limit in that production on subtag
> length.

As mentioned, the limit is imposed by other tight constraints on 'grandfathered'; you have already identified that the longest registered tag under RFC 3066 is 11 octets in length, therefore a 'grandfathered' tag can be at most 11 octets in length.

> > There are three open doors for infinite-length productions in the ABNF
> > of the current draft:
> >
> > - unlimited extlang sub-tags
> > - unlimited variant sub-tags
> > - the number of possible extensions is limited to 25
...
> > , but the length of
> > extensions is unlimited
> 
> You have missed several others:
> 
> 1. "privateuse" length is unlimited (either tacked on
>     after "lang" etc., or directly as an alternative in
>     "Language-Tag")

I disregarded this since it is identical to the case for RFC 3066, and you were, after all, charging that the draft creates problems that were worse than for RFC 3066.

> 2. "grandfathered", which as already discussed
>     permits unlimited length.

But as already stated is very tightly constrained, with a de-facto upper limit of 11 (subject to change if new tags are registered before the proposed spec is accepted).

> > We could impose some upper limits on these things...

> That leaves the extension portions' length at up to
> 25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts
> of a tag into account!   That's way too long (the RFC 2047
> limit for an encoded-word is 75 octets, including charset tag,
> some text, and some syntactic glue in addition to the language
> tag).

The problem already exists in RFC 3066. Even apart from private-use tags, tomorrow someone could request a registration for a tag that's 87 octets long, and there's nothing in RFC 3066 that would prohibit acceptance.

> > So, I think Bruce has identified a valid issue here. I personally would
> > not have characterized it as greatly exacerbating, though,
> 
> IMO, an increase from 11 octets worst-case, which is tolerable
> for constructing RFC 2047/2231 encoded-words, to >> 1850
> octets, which exceeds by a large margin what can be handled
> in a Content-Language or Accept-Language message header
> field, constitutes "greatly exacerbated".

Repeating my previous point, RFC 3066 doesn't stop a registered tag from being 10^100 octets in length. Of course, all of us know that such a tag wouldn't be useful. At some point, we have to engage common sense, even for RFC 3066. The draft would allow a tag 

en-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont-boont

(over 75 octets), but common sense tells us it doesn't make sense (and that anyone who uses such a thing deserves whatever they get). 

Now, we could try to revise the ABNF to constrain for such things, just as the ABNF of RFC 3066 could have been constrained further. It's not easy to express common-sense constraints in ABNF, however.

I suggest that wording be added to the draft giving a strong recommendatation to users that they not use tags the complete length of which exceeds 75 characters.

> > > I am absolutely shocked that a draft dealing with language
> > > lacks an "Internationalization considerations" section as
> > > recommended by RFC 2277 (a.k.a. BCP 18).
> >
> > No more or less shocking than for RFC 3066, regarding which I'm not
> > aware of any complaints.
> 
> By deferring to the bilingual ISO lists for language and country
> tags, 3066 at least provided a minimal degree of internationalization.
> By explicitly limiting description fields to English and restricting
> the charset to US-ASCII, the draft proposal takes a giant leap
> backwards.

The US-ASCII limitation existed in RFC 3066, so is not new. 

On the more general point, I believe you are mistaking i18n concerns with localization concerns: you are looking for strings to be used in UI for different local markets. Apart from charset, RFC 1766, RFC 3066 or RFC 3066bis do not have *internationalization* concerns.

> > I don't quite understand what the critique is here: what is there to
> > internationalize about language tags?
> 
> There should probably be a reference (at least informative)
> pointing to BCP 18 and mentioning that the language tags
> defined provide a means of labeling the language of text,

Have you not read the abstract in the draft?

<quote>
   This document describes the structure, content, construction, and
   semantics of language tags for use in cases where it is desirable to
   indicate the language used in an information object.
</quote>

Or the introduction?
<quote>
   One means of indicating the language used is by labeling the
   information content with a language identifier...

   This document specifies an identifier mechanism...
</quote>

How much clearer does it need to be?

> The draft (if/when approved) should also indicate that
> it updates BCP 18, which refers to RFC 1766.

Is this right? This draft is not a replacement for RFC 2277, or an addendum to it. RFC 2277 also refers to RFC 1958, which was updated by RFC 3439, but surely RFC 3439 doesn't state that it updates BCP 18? (RFC 227 does have a section with significant overlap in topic, though, so perhaps this makes sense. I'm not well-enough versed in IETF document process to know.)

> Given the divergence noted above from RFC 3066's use
> of multilingual reference lists, the Internationalization
> considerations section should include a synopsis of the
> approach chosen (viz. to restrict description to English) and
> the rationale for that choice (see BCP 18 section 6).

Again, this is a localization issue, not an internationalization issue. I do not consider this necessary or even appropriate.

> > It's
> > true that ALPHA and DIGIT are not defined
> 
> Non-sequitur aside, those terms are defined in RFC 2234.

Of course I meant "not defined *within this document*".

> > >     implications (ISO 8601 date format parsing).
> >
> > As mentioned above, this really is a non-issue.
> 
> It's an issue (esp. in light of the finger pointing regarding
> accessibility to ISO 639/3166).

As has been pointed out, there is no such finger-pointing in the draft.

> Admittedly it can be
> resolved without much difficulty (but then that could
> have been done earlier, couldn't it?).

I think the authors and those of us who have been reviewing thought that the intent was quite clearly YYYY-MM-DD, so didn't see a concern. That's why last calls are announced to a much wider audience.

> > > 2. the clear contradiction between the claims about
> > >     ABNF compatibility with RFC 3066 and the factual
> > >     incompatibility of certain provisions in the grammar.
> >
> > The main concern was with the "grandfathered" production, but I've shown
> > that that is a non-issue.
> 
> Again, it is an issue that imposes requirements on language
> tag parsers.  What you've shown is that the ABNF is not
> consistent with what was desired to be expressed, and
> that makes it an issue that needs to be addressed.

Again, I believe the bigger issue is not getting the ABNF to express what was desired, but rather whether parsers are written to consider only the ABNF or the ABNF plus other specified constraints as well.

> > The maximal length issue exists just as much
> > in RFC 3066 due to private-use tags; it is a technical concern that
> > might worth reviewing in RFC 3066bis, however; but it is not
> > insurmountable, and not a new problem.
> 
> Private-use carries its own considerable baggage; aside from
> that, the draft proposal increases the length of non-private
> tags that affect both protocol design and implementations
> from a worst case maximum of 11 octets under RFC 3066...

Worst case at present; a month from now it could be unlimitedly larger. But I've accepted that it would be an improvement to add constraints on overall length.

Peter Constable
Microsoft Corporation