draft-phillips-langtags-08, process, specifications, "stability",  and extensions

Bruce Lilly blilly at erols.com
Sat Jan 1 18:50:50 CET 2005


>  Date: 2004-12-30 07:26
>  From: "Peter Constable" <petercon at microsoft.com>
>  To: ietf-languages at alvestrand.no, ietf at ietf.org
>  
> > From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> > bounces at alvestrand.no] On Behalf Of Bruce Lilly
> 
> 
> > So why not then also throw in the closely linked specification of
> > the Content-Language field, which has historically been in the same
> > document (RFC 1766)?
> 
> It was removed in the development of RFC 3066, which was appropriate because it was a particular application involving language tags;

We may be in danger of confusing terminology; the Content-Language
field is a means of specifying the language of a (part of) a
document, using a language-tag in the context of the MIME protocols.
"The World Wide Web" is an application.  Separating the specification
of language via a field from registration procedure was entirely
appropriate, as BCP documents are used for procedures and policies
and not for technical specifications.

> other applications exist, and other applications may use different approaches for how matching should be done. 

So I take it that you agree that the technical specification of
matching algorithms should also be separated from the tag registration
procedure?

> > > > Harald Alvestrand pointed out some time ago, that (inappropriately)
> > > > shifts implementation effort from the tag generator (no registration
> > > > required) to the recipient (what the heck does this mysterious tag
> > > > actually *mean*).
> 
> The *meaning* of any given language tag would be no more or less a problem under the proposed revision than it was for RFC 3066 or RFC 1766. For instance, there is a concurrent thread that has been discussing when country distinctions are appropriate or recommended ("ca" or "ca-ES"?); this discussion pertains to RFC 3066, and part of the issue is that meanings of tags are implied rather than specified -- and always have been even under RFC 1766 (I pointed this out five years ago when we were working on preparing RFC 3066).
> 
> So, for instance, when an author uses "de-CH", what does he intend recipients to understand to be the difference between that and "de-DE" or even "de"? Neither RFC 1766 or RFC 3066 shed any light on this, and ultimately only the author knows for sure.

That's a somewhat different take on the issue; certainly the ability
to use a generative mechanism (i.e. w/o review/registration of an
entire tag) can lead to a proliferation of incompatible uses by
independent generators (and possibly loss of interoperability as a
result). The draft under discussion would expand use of generative
mechanisms to encompass all but private-use tags, and thereby expands
the potential for such incompatibilities and loss of interoperability.
 
> > Under the proposed draft, anybody may legally generate
> > a tag such as
> >   sr-Latn-CS-gaulish-boont-guoyu-i-enochian
> > or
> >   sr-Latn-CS-gaulish-boont-guoyu-i-enochian-x-foo
> > with *no* specific registration requirements (i.e. all components
> > are either registered or require no registration). In the latter
> > case, a parser can only determine that it contains a private-use
> > subtag after wading through the other subtags.  In either case,
> > it is difficult (to say the least) for the recipient or his
> > software to determine what the generator of that tag intended to
> > convey.
> 
> I've shown that this is no different in general that what already exists for RFC 3066 or RFC 1766.

It is certainly different; under RFC 3066 rules such a tag (as a whole)
would be subject to review and registration.

> And I think we can all agree that there's no much less likelihood of someone generating sr-Latn-CS-gaulish-boont-guoyo-i-enochian than there is of someone generating something like pl-AZ. So, I suggest that we not dwell on pathological cases that we aren't really likely to encounter. 

Please don't confuse a specific example with the general principle.
Also, in technical specifications (of language tag syntax or anything
else), "liklihood" is largely irrelevant; the quality of the
specification is dependent on how well it handles all cases, including
edge cases.
 
> > A recipient using software that interprets RFC 3066
> > tags isn't going to be able to do anything useful with any
> > hypothetical tag which contains a script subtag that would be
> > produced under the draft rules (if the script subtag were to appear
> > *after* the region sugtag, one could at least match "sr-CS-Latn"[...]
> > to "sr-CS", which an RFC 3066 parser could handle.
> 
> This would be no more or less true of registered tags like "az-Latn-AZ", for which registration requests were submitted but those were postponed (by the submitter withdrawing the request) until details for RFC3066bis were worked out. Again, the concerns you are raising in relation to the the proposed replacement of RFC 3066 apply equally to RFC 3066 itself.

No, you seem to have missed the point; there exist RFC 3066
implementations. Such implementations, using the RFC 3066 rules,
could match something like "sr-CS-Latn" to "sr-CS", but could
not match "sr-Latn-CS" to "sr-CS".  By changing the definition of
the interpretation of the second subtag, the proposed draft fails
to be compatible with existing deployed implementations (which is
what is meant by "backwards compatibility", which is a prime
consideration for Internet protocols).

> > > > It's not entirely clear if some of those items (e.g. script) should
> > > > be expressed by an orthogonal mechanism rather than embedded in a
> > > > *language* tag (for that matter, in retrospect, country codes was
> > > > probably a bad idea).
> 
> Of course it would not be clear if you don't have a conceptual model of what "language" tags are identifiers *of*. When RFC 3066 was being developed, there was a suggestion that script IDs be incorporated, but some were reluctant, raising the same question you have here. I was one of those. But I didn't remain obstructionist over the issue; instead, I gave a fair amount of thought to the ontology that underlies "language" tags, and subsequently published a white paper and presented on the topic at two conferences in the spring and fall of 2002. (Paper is available online at http://www.sil.org/silewp/abstract.asp?ref=2002-003 -- my thinking has evolved since then, but some key results remain valid, I think.) 

It's an issue of what is essential vs. what might be an orthogonal
issue applicable to specific cases.  That should (in an IETF
specification) take core Internet protocols into consideration.

> At this point, I feel confident that it is not a problem to combine script IDs into "language" tags, and this is the consensus of the domain experts that have been discussing this proposed revision for the past year and more.

Evidently w/o considering the implications of and for core Internet
protocols.  If script *can be* specified in a language tag *between*
the language code and country code, then a parser must be able to 
recognize that case and deal with it appropriately (which, as noted
above, existing RFC 3066 implementations in deployed use do not and
cannot do) at *any* time and in any context (context may not be
available when a Content-Language field is parsed).  I don't have an
issue with provision for specification of script where appropriate,
but for crying out loud, at least do so in a compatible manner (e.g.
a Content-Script field) rather than a) breaking compatibility with
deployed protocols and b) burdening applications which need not be
concerned with script from having to parse script information.

> > > There is a clear need for script codes...
> 
> > But none of that applies to an audio file of spoken material,
> > where script would be superfluous...
> 
> Not a problem: the proposed revision *allows* for the use of script IDs but does not require them.

Yes, it's a problem. Having allowed them, each parser must be able
to handle them.

> In the case of audio content, one simply would never include a script ID. 

But a Content-Language field parser needs to be able to parse *any*
Content-Language field, without knowledge of whether the content
that is referred to by that field is audio, video, image, model,
application, or text.  Generation is easy; printf("%s", whatever); --
the problem is in parsing, particularly considering the deployed base
of RFC 3066-compliant parsers.

> > and, as noted above, would
> > lead to loss of backwards compatibility.
> 
> But, as noted above, this is not an issue that is peculiar to the proposed revision -- it already existed in RFC 3066.

No, given a primary subtag which is a language code (and per RFCs
1766 and 3066, that's any primary subtag with 2 or more (RFC 3066
only, more being limited to 3) characters), the second subtag --
in either RFC 1766 or RFC 3066 language tags -- is always a country
code and never a script code.  The proposed draft pulls the rug out
from under existing parsers by changing that.

> The bigger problem you're pointing out is the limitations of using suffix-truncation alone as a matching algorithm. In the discussion following the registration request for de-1996, etc., there was some discussion as to whether de-1996-DE format or de-DE-1996 format was preferable, and in the course of that discussion it was mentioned that some times the 1901 vs 1996 spelling differences would be more important than the regional dialect differences, but in other situations the regional differences would be more important than the spelling. But the problem with prefix matching used e.g. for Accept-Language is that only one of these two can be supported. That is a shortcoming in that application. 

Again you seem to be conflating established Internet Standards Track
protocols with "applications" and ignoring the critical importance of
backwards compatibility.  Regardless, you again seem to be supporting
separation of matching algorithms from registration.

> Note that there is nothing that prevents other applications from using other matching algorithms, including perhaps something that is able to recognize in "az-AZ" and "az-Latn-AZ" that both involve Azeri and used in Azerbaijan.

The issue at hand is the existing deployed base of RFC 3066
implementations that depend on the matching algorithm specified
therein (which doesn't work with a script tag interposed between
language code and country code).

> > Surely some types
> > of script is indicated by the charset; in situations where that
> > is not the case, a separate mechanism could be used for that
> > orthogonal parameter without breaking compatibility with
> > existing parsers of language tags.
> 
> This is all a discussion we on the IETF-languages list went through five years ago, and in the intervening five years I think we have reached a consensus on these issues, that consensus being reflected in the proposed revision to RFC 3066. (Note that we made the relevant decisions over a year and a half ago when we reached a consensus to register az-Latn etc. The precedent was established then; the proposed revision adds nothing new in this regard.)

As previously noted, that is a danger recognized by RFC 2026 in
activity that does not conform to IETF procedures; it is
possible to reach good consensus on the wrong approach.

> > Does the ISO not set ground rules for the 3166/MA?  Could it not
> > specify that codes are not to be reused?
> 
> No, ISO does not. The ground rules for the ISO 3166/MA are established in ISO 3166. I don't have the current version immediately at hand, but I believe the ground rules it specified were simply that something not be re-used for at least five years after it has been withdrawn. The re-assignment of CS made several parties very upset, and I note that the CD for the revision to ISO 3166-1 which is in progress has upped this to 50 year, and added a clause saying, "Before reallocating... the ISO 3166/MA shall consult, as appropriate, the authority or agency on whose behalf the code element was
> reserved and consideration shall be given to difficulties which might arise from the reallocation" -- nothing about consulting other users. 

I would think that that's covered by the "difficulties which might arise..."
part.  In any event, as the ISO seems to be in the process of tightening
the rules, it would be a more productive and mutually beneficial process
to convince the ISO to add specific language addressing specific issues
than to go off in a hissy fit saying (in effect) "we're setting up a
registry in competition with the ISO lists specifically to second-guess
the ISO and its MA". [By a process which demonstrably doesn't abide by
its own rules, I might add.]

> > > Matching hasn't actually changed...
> 
> > Do you not see the contradiction between "one should not expect to
> > receive anything less specific" vs. "may receive less specific
> > content"?
> 
> There is no substantive change from RFC 3066. RFC 3066 happened to mention one particular matching approach used in one application (HTTP), in relation to which it defined "language range"; but there is no question that there are different approaches to matching used in different applications, some of which may well involve receiving content the linguistic properties of which are not within the specific properties requested; and besides, the proposed revision retains the exact same definition for "language range" (for the sake of whatever applications may use that notion).

The problem is that the change to the language tag format is incompatible
with that algorithm.  Incidentally, HTTP is mentioned w.r.t. the syntax
for language-range, but does not restrict use of the matching algorithm
or of the Accept-Language field to HTTP or any other specific protocol
or set of protocols.
 
> > Please see RFC 2026 sections 7.1, 7.1.1, 7.1.3, and 10.1.
> > Note that RFC 3066 strictly complies with those sections, while
> > the draft under discussion, by cherry-picking from ISO lists
> > for which change control has not been transferred to the IESG,
> > does not.
> 
> 7.1 says,
> 
> <quote>
> To avoid conflict between competing versions of a specification, the
>    Internet community will not standardize a specification that is
>    simply an "Internet version" of an existing external specification
>    unless an explicit cooperative arrangement to do so has been made.
>    However, there are several ways in which an external specification
>    that is important for the operation and/or evolution of the Internet
>    may be adopted for Internet use.
> </quote>
> 
> The proposed revision does not create Internet-specific versions of ISO standards; it uses IDs drawn from ISO standards with semantics defined in those source standards at the time they were adopted for use in language tags -- the source for the IDs, the symbols and their meanings all reside in the ISO standards. The fact that not all are used, or that some are used as they were specified in dated version of the ISO standard is not in contradiction with 7.1 -- it's just one of "several ways in which an external specification... may be adopted."

By cherry-picking, it effectively seeks to establish such a version.
The "several ways' refers not to some random procedure, but to specific
provisions in RFC 2026; moreover, ISO documents are specifically
covered by provisions regarding open external standards (as opposed
to proprietary specifications).

> 7.1.1 simply says that an open extenal standard may be incorporated merely by reference. There is no requirement here that is not met by the proposed revision.

It does not give leave to cherry-pick bits and pieces of an external
specification.  RFC 3066 does not do so. The draft under discussion
does.

> 7.1.3 simply says that an Internet specification may be an adaptation of an external specification provided certain conditions are met. Neither RFC 3066 or the proposed revision are adaptations of any existing external specification, so this is not applicable.

See above. Has ISO transferred change control to the IETF so that it
can declare some codes invalid?

> 10.1 states a general policy regarding IP: 
> 
> <quote>
> In all matters of intellectual property rights and procedures, the
>    intention is to benefit the Internet community and the public at
>    large, while respecting the legitimate rights of others.
> </quote>
> 
> Again, there is no requirement stated here that is not met by the proposed revision. Clearly, the intent of the proposed draft is to benefit the Internet community and the public at large. There are no rights of others that are in any way violated by the proposed revision.

The ISO, as developers of ISO 639 and 3166, have rights. In particular,
they have the right to determine what those standards specify -- in
whole -- and they have the right to revise and amend those standards,
and are the sole arbiters of what is (and what is not) "valid".

> > Agreed.  But the activity on the ietf-languages list regarding the
> > draft under discussion isn't an IETF process -- there is no WG or
> > Chair, no charter, etc.  Like the fictional Topsy, it jes' growed up.
> 
> RFC 3066 was developed in exactly the same manner as this proposed revision has been developed -- as an internet draft prepared by a member of the the IETF-languages list and processed among members of that list until it was submitted for last call and subsequent IESG action.

There is a time limit within which objections may be raised. That limit
has passed. Moreover, RFC 3066 had fairly minor backwards compatibility
issues and corrected some defects by splitting off an independent
specification. The draft under discussion has many serious compatibility
issues, and there are issues (e.g. cherry-picking open external standard
content, ignoring core Internet protocols) that have raised procedural
issues.  To wit, the benefit of the Internet Community is probably best
served by establishing an IETF Working Group, with corresponding
procedures, a charter, etc.


More information about the Ietf-languages mailing list