RE: draft-phillips-langtags-08, process, specifications, "stability", and extensions

Thu Dec 30 13:26:56 CET 2004

> From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> bounces at alvestrand.no] On Behalf Of Bruce Lilly

> So why not then also throw in the closely linked specification of
> the Content-Language field, which has historically been in the same
> document (RFC 1766)?

It was removed in the development of RFC 3066, which was appropriate because it was a particular application involving language tags; other applications exist, and other applications may use different approaches for how matching should be done.

> > > > No, the revision clearly expands the scope of language
> > > distinctions that can be represented with a language tag--quite
> > > significantly in some cases.
> > >
> > > Indeed, and without registration of the tags and the review process
> > > associated with that (existing RFC 3066) registration procedure. As
> > > Harald Alvestrand pointed out some time ago, that (inappropriately)
> > > shifts implementation effort from the tag generator (no registration
> > > required) to the recipient (what the heck does this mysterious tag
> > > actually *mean*).

The *meaning* of any given language tag would be no more or less a problem under the proposed revision than it was for RFC 3066 or RFC 1766. For instance, there is a concurrent thread that has been discussing when country distinctions are appropriate or recommended ("ca" or "ca-ES"?); this discussion pertains to RFC 3066, and part of the issue is that meanings of tags are implied rather than specified -- and always have been even under RFC 1766 (I pointed this out five years ago when we were working on preparing RFC 3066).

So, for instance, when an author uses "de-CH", what does he intend recipients to understand to be the difference between that and "de-DE" or even "de"? Neither RFC 1766 or RFC 3066 shed any light on this, and ultimately only the author knows for sure.

Under RFC 3066, it was the *exceptional* case that a complete tags was registered, allowing some indication of the meaning of the whole (though even in that regard nothing really required that the documentation provide clear indication of the meaning). The 98% cases were those like "de-CH" in which it was assumed that everyone would understand what the intended meaning is.

> Under the proposed draft, anybody may legally generate
> a tag such as
>   sr-Latn-CS-gaulish-boont-guoyu-i-enochian
> or
>   sr-Latn-CS-gaulish-boont-guoyu-i-enochian-x-foo
> with *no* specific registration requirements (i.e. all components
> are either registered or require no registration). In the latter
> case, a parser can only determine that it contains a private-use
> subtag after wading through the other subtags.  In either case,
> it is difficult (to say the least) for the recipient or his
> software to determine what the generator of that tag intended to
> convey.

I've shown that this is no different in general that what already exists for RFC 3066 or RFC 1766. And I think we can all agree that there's no much less likelihood of someone generating sr-Latn-CS-gaulish-boont-guoyo-i-enochian than there is of someone generating something like pl-AZ. So, I suggest that we not dwell on pathological cases that we aren't really likely to encounter.

> A recipient using software that interprets RFC 3066
> tags isn't going to be able to do anything useful with any
> hypothetical tag which contains a script subtag that would be
> produced under the draft rules (if the script subtag were to appear
> *after* the region sugtag, one could at least match "sr-CS-Latn"[...]
> to "sr-CS", which an RFC 3066 parser could handle.

This would be no more or less true of registered tags like "az-Latn-AZ", for which registration requests were submitted but those were postponed (by the submitter withdrawing the request) until details for RFC3066bis were worked out. Again, the concerns you are raising in relation to the the proposed replacement of RFC 3066 apply equally to RFC 3066 itself.

> > > It's not entirely clear if some of those items (e.g. script) should
> > > be expressed by an orthogonal mechanism rather than embedded in a
> > > *language* tag (for that matter, in retrospect, country codes was
> > > probably a bad idea).

Of course it would not be clear if you don't have a conceptual model of what "language" tags are identifiers *of*. When RFC 3066 was being developed, there was a suggestion that script IDs be incorporated, but some were reluctant, raising the same question you have here. I was one of those. But I didn't remain obstructionist over the issue; instead, I gave a fair amount of thought to the ontology that underlies "language" tags, and subsequently published a white paper and presented on the topic at two conferences in the spring and fall of 2002. (Paper is available online at http://www.sil.org/silewp/abstract.asp?ref=2002-003 -- my thinking has evolved since then, but some key results remain valid, I think.) 

At this point, I feel confident that it is not a problem to combine script IDs into "language" tags, and this is the consensus of the domain experts that have been discussing this proposed revision for the past year and more.

> > There is a clear need for script codes...

> But none of that applies to an audio file of spoken material,
> where script would be superfluous...

Not a problem: the proposed revision *allows* for the use of script IDs but does not require them. In the case of audio content, one simply would never include a script ID.

> and, as noted above, would
> lead to loss of backwards compatibility.

But, as noted above, this is not an issue that is peculiar to the proposed revision -- it already existed in RFC 3066.

The bigger problem you're pointing out is the limitations of using suffix-truncation alone as a matching algorithm. In the discussion following the registration request for de-1996, etc., there was some discussion as to whether de-1996-DE format or de-DE-1996 format was preferable, and in the course of that discussion it was mentioned that some times the 1901 vs 1996 spelling differences would be more important than the regional dialect differences, but in other situations the regional differences would be more important than the spelling. But the problem with prefix matching used e.g. for Accept-Language is that only one of these two can be supported. That is a shortcoming in that application. 

Note that there is nothing that prevents other applications from using other matching algorithms, including perhaps something that is able to recognize in "az-AZ" and "az-Latn-AZ" that both involve Azeri and used in Azerbaijan.

> Surely some types
> of script is indicated by the charset; in situations where that
> is not the case, a separate mechanism could be used for that
> orthogonal parameter without breaking compatibility with
> existing parsers of language tags.

This is all a discussion we on the IETF-languages list went through five years ago, and in the intervening five years I think we have reached a consensus on these issues, that consensus being reflected in the proposed revision to RFC 3066. (Note that we made the relevant decisions over a year and a half ago when we reached a consensus to register az-Latn etc. The precedent was established then; the proposed revision adds nothing new in this regard.)

> Does the ISO not set ground rules for the 3166/MA?  Could it not
> specify that codes are not to be reused?

No, ISO does not. The ground rules for the ISO 3166/MA are established in ISO 3166. I don't have the current version immediately at hand, but I believe the ground rules it specified were simply that something not be re-used for at least five years after it has been withdrawn. The re-assignment of CS made several parties very upset, and I note that the CD for the revision to ISO 3166-1 which is in progress has upped this to 50 year, and added a clause saying, "Before reallocating... the ISO 3166/MA shall consult, as appropriate, the authority or agency on whose behalf the code element was
reserved and consideration shall be given to difficulties which might arise from the reallocation" -- nothing about consulting other users. 

> > Matching hasn't actually changed...

> Do you not see the contradiction between "one should not expect to
> receive anything less specific" vs. "may receive less specific
> content"?

There is no substantive change from RFC 3066. RFC 3066 happened to mention one particular matching approach used in one application (HTTP), in relation to which it defined "language range"; but there is no question that there are different approaches to matching used in different applications, some of which may well involve receiving content the linguistic properties of which are not within the specific properties requested; and besides, the proposed revision retains the exact same definition for "language range" (for the sake of whatever applications may use that notion).

> Please see RFC 2026 sections 7.1, 7.1.1, 7.1.3, and 10.1.
> Note that RFC 3066 strictly complies with those sections, while
> the draft under discussion, by cherry-picking from ISO lists
> for which change control has not been transferred to the IESG,
> does not.

7.1 says,

<quote>
To avoid conflict between competing versions of a specification, the
   Internet community will not standardize a specification that is
   simply an "Internet version" of an existing external specification
   unless an explicit cooperative arrangement to do so has been made.
   However, there are several ways in which an external specification
   that is important for the operation and/or evolution of the Internet
   may be adopted for Internet use.
</quote>

The proposed revision does not create Internet-specific versions of ISO standards; it uses IDs drawn from ISO standards with semantics defined in those source standards at the time they were adopted for use in language tags -- the source for the IDs, the symbols and their meanings all reside in the ISO standards. The fact that not all are used, or that some are used as they were specified in dated version of the ISO standard is not in contradiction with 7.1 -- it's just one of "several ways in which an external specification... may be adopted."

7.1.1 simply says that an open extenal standard may be incorporated merely by reference. There is no requirement here that is not met by the proposed revision.

7.1.3 simply says that an Internet specification may be an adaptation of an external specification provided certain conditions are met. Neither RFC 3066 or the proposed revision are adaptations of any existing external specification, so this is not applicable.

10.1 states a general policy regarding IP: 

<quote>
In all matters of intellectual property rights and procedures, the
   intention is to benefit the Internet community and the public at
   large, while respecting the legitimate rights of others.
</quote>

Again, there is no requirement stated here that is not met by the proposed revision. Clearly, the intent of the proposed draft is to benefit the Internet community and the public at large. There are no rights of others that are in any way violated by the proposed revision.

Thus, I see no difference between RFC 3066 and this proposed revision in relation to compliance with the sections of RFC 2026 you referred to.

> Agreed.  But the activity on the ietf-languages list regarding the
> draft under discussion isn't an IETF process -- there is no WG or
> Chair, no charter, etc.  Like the fictional Topsy, it jes' growed up.

RFC 3066 was developed in exactly the same manner as this proposed revision has been developed -- as an internet draft prepared by a member of the the IETF-languages list and processed among members of that list until it was submitted for last call and subsequent IESG action.

Peter Constable