New Last Call: 'Tags for Identifying Languages' to BCP

Sat Dec 11 05:39:18 CET 2004

> RE: New Last Call: 'Tags for Identifying Languages' to BCP
>  Date: 2004-12-10 20:03
>  From: "Peter Constable" <petercon at microsoft.com>
>  To: ietf at ietf.org
>  CC: ietf-languages at alvestrand.no
>  
> Resuming my comments:

> > Specifically, the draft allows, and RFC 3066 disallows:
> >    subtags more than 8 octets in length
> >    hyphens which do not separate subtags
> >    zero-length subtags
> >    primary tags which are not purely alphabetic
> > Curiously, all of those are permitted by the draft ABNF
> > production "grandfathered"...
> 
> The "grandfathered" production in the current draft is 
> 
> grandfathered   = ALPHA *(alphanum / "-")
> 
> which does permit the sequences claimed by Bruce (except for
> not-purely-alphabetic primary sub-tags),

No exception.  "alphanum" is ALPHA / DIGIT.  In plain
English, "grandfathered" as defined in the draft is a letter
followed by any number of letters, digits, and/or hyphens, in
any order.  And that includes "a123-xyz" as I initially stated,
and clearly 1, 2, and 3 are digits.

> syntactically; but the set of 
> tags available for use is constrained by more than the ABNF syntax
> alone: the acceptable productions for each sub-tag must either be taken
> from one of the source standards or be registered.

So what? The ABNF is an expression of the grammar that
describes the set of all valid tags.  If the grammar permits
"y-----", "a123-xyz", etc. (and it does) then a parser
claiming to parse language tags as defined by that ABNF
must be able to parse such tags.  That is, the ABNF-
specified grammar imposes requirements on parsers.  If
one doesn't intend to impose such requirements, the
ABNF specifying the grammar should be changed
accordingly.

> This is no different 
> from RFC 3066, so it is no more of a problem in this specification than
> it was in RFC 3066.

It is a very different grammar from RFC 3066, imposing
very different requirements on parsers.

> It might be that the wording in 2.2 could be tightened up to eliminate
> any possible question regarding the source for "grandfathered"
> productions.

It's not a matter of wording; the problem is with the ABNF.

> Alternately, there's no reason why the "grandfathered" production
> shouldn't be composed exactly to match what was used in RFC 3066:
> 
> grandfathered = 1*8ALPHA *("-" 1*8alphanum)

I believe I said as much (though one then needs to look
at reduce/reduce conflicts implied by the revised grammar):

> > I see no reason for the ABNF to permit such content as is
> > forbidden by RFC 3066; the actual ABNF for what RFC 3066
> > permits is contained within 3066, and could have been directly
> > incorporated rather than producing a "grandfathered"
> > production which opens up several cans of worms.
> 
> This vastly overstates the problem. There is no can of worms unless it
> exists in tags currently available under RFC 3066.

I referred to the additional requirements imposed on
parsers, as well as the unlimited tag length permitted.

> > One defect related to tag length in RFC 3066 is not remedied
> > by the draft; indeed the problem is greatly exacerbated...
> 
> > Unfortunately, a language- tag's length is unlimited by
> > the ABNF in RFC 3066 (due to an unlimited number of subtags)
> > and in the draft...
> 
> > In particular, tags other than private-use tags with more than
> > two subtags require registration under RFC 3066 rules, and it
> > is a trivial matter to determine the longest registered tag.
> > The draft, however, encourages use of more subtags as well as
> > removal of the subtag length upper bound; moreover, it permits
> > infinite numbers of subtags without requiring registration of
> > the resulting complete tag.
> 
> Bruce states incorrectly that there is no upper bound on the length of
> sub-tags.

Look again at the draft definition of "grandfathered" -- now
show me where there's a limit in that production on subtag
length.

> His other concern, on the overall length of complete tags, is 
> valid, however: in terms of the ABNF syntax for both RFC 3066 and RFC
> 3066bis, infinite-length productions are possible, but RFC 3066 would
> require registration of complete non-private-use tags while RFC 3066bis
> does not.

Yes, and a quick look at the registry reveals that the longest
tag is 11 octets ("cel-gaulish").

> There are three open doors for infinite-length productions in the ABNF
> of the current draft:
> 
> - unlimited extlang sub-tags
> - unlimited variant sub-tags
> - the number of possible extensions is limited to 25

The ABNF indicates no such limit.

> , but the length of 
> extensions is unlimited

You have missed several others:

1. "privateuse" length is unlimited (either tacked on
    after "lang" etc., or directly as an alternative in
    "Language-Tag")

2. "grandfathered", which as already discussed
    permits unlimited length.

> 
> We could impose some upper limits on these things; e.g.
> 
> Language-Tag = ... *8("-" extlang) ... *8("-" variant) ... 1*25("-"
> extension)

I think you mean *25("-" extension), not 1*25...

> extension = singleton 1*8("-" 2*8alphanum)

That leaves the extension portions' length at up to
25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts
of a tag into account!   That's way too long (the RFC 2047
limit for an encoded-word is 75 octets, including charset tag,
some text, and some syntactic glue in addition to the language
tag).  Heck, 1850 octets won't even fit into a maximum length
RFC [2]821/[2]822 message line (998 octets).

> If we also imposed limits on the length of private-use tags and defined
> the grandfathered production in a way that made clear there was an upper
> limit for those, then we could end up eliminating an issue that had
> existed in RFC 3066.

Perhaps; but you have a long way to go to get from 1850+
down to <64 octets.  Even farther to get to something
as reasonable as the current worst-case of 11 octets.

> So, I think Bruce has identified a valid issue here. I personally would
> not have characterized it as greatly exacerbating, though,

IMO, an increase from 11 octets worst-case, which is tolerable
for constructing RFC 2047/2231 encoded-words, to >> 1850
octets, which exceeds by a large margin what can be handled
in a Content-Language or Accept-Language message header
field, constitutes "greatly exacerbated".  YMMV. [N.B. that
">>1850" takes into account your proposed restrictions which
are not present in the draft]

> as the issue 
> was present in RFC 3066: private-use tags did not need to be registered
> in RFC 3066, so there was no way in implementation could be written with
> certain knowledge that tags beyond some given length would not be
> encountered.

True, but:
A. implementation is only one issue; protocol design (encoded-
    words and message header fields, for example) is a more
    important issue
B. private-use tags require end-to-end cooperation as a
    prerequisite; given such cooperation, agreement can be
    reached on tag length
C. Per some readings of BCP 82, not only are implementations
    not required to support experimental/private-use values,
    they are expected to erect barriers to their use, requiring
    users to specifically enable use of experimental/private-use
    functionality.

> > I am absolutely shocked that a draft dealing with language
> > lacks an "Internationalization considerations" section as
> > recommended by RFC 2277 (a.k.a. BCP 18).
> 
> No more or less shocking than for RFC 3066, regarding which I'm not
> aware of any complaints.

By deferring to the bilingual ISO lists for language and country
tags, 3066 at least provided a minimal degree of internationalization.
By explicitly limiting description fields to English and restricting
the charset to US-ASCII, the draft proposal takes a giant leap
backwards.

> I don't quite understand what the critique is here: what is there to
> internationalize about language tags?

There should probably be a reference (at least informative)
pointing to BCP 18 and mentioning that the language tags
defined provide a means of labeling the language of text,
when combined with other mechanisms (RFC 2047/2231
encoded-words, Content-Language fields, etc.), to
implement the BCP 18 requirement for language tagging.

The draft (if/when approved) should also indicate that
it updates BCP 18, which refers to RFC 1766.

Given the divergence noted above from RFC 3066's use
of multilingual reference lists, the Internationalization
considerations section should include a synopsis of the
approach chosen (viz. to restrict description to English) and
the rationale for that choice (see BCP 18 section 6).
[Conversely the difficulty in writing a convincing rationale
might prompt some effort into producing a less
Anglo-centric design.]

> It's 
> true that ALPHA and DIGIT are not defined

Non-sequitur aside, those terms are defined in RFC 2234.

> > Perhaps even more disturbing is the content of the "IANA
> > Considerations" section; the draft predicts that certain things
> > will happen ("IANA will"[...]), but doesn't actually direct
> > (e.g. "IANA shall") IANA to do anything.  The placement of that
> > section does not correspond to current RFC-Editor guidelines
> > (it should appear after Security Considerations); also on that
> > point, Appendices should precede References.
> 
> There is a process issue here, but I have assumed that the authors have
> dealt with IANA on that. Otherwise, these are editorial issues -- "even
> more disturbing" seems to me to be somewhat overstated.

The words "will" and "shall" have very distinct meanings.  If
one expects IANA to take specific action, it would be advisable
to clearly specify that IANA shall do so, rather than merely
expressing the hope that IANA will do so.

> > Many of the references are obsolete (e.g. RFCs 1327,
> > 1521)... and at least one reference ([19])
> > gives a bracketed URI rather than the correctly formatted
> > RFC reference. 

The RFC-Editor provides an "rfc-ref.txt" file containing the
preferred citations.  That file contains an "Obsoleted By"
column that points authors to the current RFC.  This isn't
rocket science...

> In fairness to the authors, page-oriented plain text is not exactly
> conducive to authoring and revising a long document,

There's no requirement to author in final publication
form. In fact the original RFC Editor has provided
guidelines and suggestions in the form of RFC 2233,
discussing methods that have been used successfully in
publishing quite long documents (textbooks!).  The current
RFC-Editor staff has a draft update. 

> >     implications (ISO 8601 date format parsing).
> 
> As mentioned above, this really is a non-issue.

It's an issue (esp. in light of the finger pointing regarding
accessibility to ISO 639/3166). Admittedly it can be
resolved without much difficulty (but then that could
have been done earlier, couldn't it?).

> > 2. the clear contradiction between the claims about
> >     ABNF compatibility with RFC 3066 and the factual
> >     incompatibility of certain provisions in the grammar.
> 
> The main concern was with the "grandfathered" production, but I've shown
> that that is a non-issue.

Again, it is an issue that imposes requirements on language
tag parsers.  What you've shown is that the ABNF is not
consistent with what was desired to be expressed, and
that makes it an issue that needs to be addressed.

> The maximal length issue exists just as much 
> in RFC 3066 due to private-use tags; it is a technical concern that
> might worth reviewing in RFC 3066bis, however; but it is not
> insurmountable, and not a new problem.

Private-use carries its own considerable baggage; aside from
that, the draft proposal increases the length of non-private
tags that affect both protocol design and implementations
from a worst case maximum of 11 octets under RFC 3066
registered tags to an infinite length, which is unworkable
for existing Standards Track protocols (RFC 2822 at
Proposed, RFC 2047 at Draft, and RFC 822 at Full Standard,
to name a few).