RE: draft-phillips-langtags-08, process, specifications, "stability", and extensions

Peter Constable petercon at
Sun Jan 2 03:27:01 CET 2005

> From: ietf-languages-bounces at [mailto:ietf-languages-
> bounces at] On Behalf Of Bruce Lilly

> > It was removed in the development of RFC 3066, which was appropriate
> because it was a particular application involving language tags;
> We may be in danger of confusing terminology; the Content-Language
> field is a means of specifying the language of a (part of) a
> document, using a language-tag in the context of the MIME protocols.
> "The World Wide Web" is an application.

The *use* of RFC 3066 language tags in a particular field of some protocol is an application of RFC 3066. That is what I meant.

> Separating the specification
> of language via a field from registration procedure was entirely
> appropriate, as BCP documents are used for procedures and policies
> and not for technical specifications.

I'm not as deeply versed in the distinct purposes of distinct kinds of documents used for Internet specifications as you. It's my understanding that there is a general expectation that BCP documents are used for something like a registration procedure but not for a technical specification. Joel Halpern recently reported to me that he provide similar feedback on the draft to Harald Alvestrand some time ago, that Harald responded with reasons why things were mixed in this case, and that the two of them concluded that mixing these in a BCP document in this case was acceptable.

We already anticipate future revisions, and it would be a possibility to consider whether a division of the content into distinct documents of different types would be better. In view of the time already taken, the delays incurred, and the fact that there have been products in development that have been assuming the completion of this current round of revision, and think it makes best sense to allow the mixing that has existed since RFC 1766 to persist in a BCP for this round.

> So I take it that you agree that the technical specification of
> matching algorithms should also be separated from the tag registration
> procedure?

Again, apart from knowing the preferred divisions of content into different types of IETF documents, I don't think there's a problem in describing one type of matching algorithm in this document provided it is recognized that some applications may require different algorithms. And, again, I don't think it would be particularly helpful to delay completion of this revision any further to address an issue of mixing different kinds of content in a BCP that has existed through two versions already; if it's a serious problem, it can be remedied in the next, already-anticipated revision.

> > The *meaning* of any given language tag would be no more or less a
> problem under the proposed revision than it was for RFC 3066 or RFC 1766...

> That's a somewhat different take on the issue; certainly the ability
> to use a generative mechanism (i.e. w/o review/registration of an
> entire tag) can lead to a proliferation of incompatible uses by
> independent generators (and possibly loss of interoperability as a
> result). The draft under discussion would expand use of generative
> mechanisms to encompass all but private-use tags, and thereby expands
> the potential for such incompatibilities and loss of interoperability.

That was a issue I initially voiced when it was first suggested that the registry be a registry of subtags rather than tags. In practice, I'm not sure at this point that there's really a significant greater problem with the new level of generativity than there was before. The reason is that the new elements have quite specific semantic effects on the whole, whereas the semantic impact of a region ID on the whole is less certain: it may imply dialectal variants, spelling variants; it might actually reflect nothing but simply have been inserted because it could. In contrast, there is little question of what effect a script ID such as "Hans" has on the whole.

> > > Under the proposed draft, anybody may legally generate
> > > a tag such as
> > >   sr-Latn-CS-gaulish-boont-guoyu-i-enochian
> > > or
> > >   sr-Latn-CS-gaulish-boont-guoyu-i-enochian-x-foo
> > > with *no* specific registration requirements (i.e. all components
> > > are either registered or require no registration). In the latter
> > > case, a parser can only determine that it contains a private-use
> > > subtag after wading through the other subtags.  In either case,
> > > it is difficult (to say the least) for the recipient or his
> > > software to determine what the generator of that tag intended to
> > > convey.
> >
> > I've shown that this is no different in general that what already exists
> for RFC 3066 or RFC 1766.
> It is certainly different; under RFC 3066 rules such a tag (as a whole)
> would be subject to review and registration.

You're describing a pathological case that is never going to occur, which I don't think is particularly helpful. If the meaning of "guoyu" and of "boont" are clearly documented, it will be evident that something like "sr-boont-guoyu" is basically as meaningless as "sr-guoyu" or as "pl-Hant-TH". 

You might respond that "pl-Hant-TH" is at least conceivable whereas "sr-guoyu" is oxymoronic, but it's just as useless as long as it has no correlation with the real world. I do not think it should be a requirement of a language tag specification to constrain combinations that are not useful, inconsistent with reality, or logically impossible. Likewise, it really isn't necessary IMO to insist on registration of complete tags vs. subtags just to avoid tags that are not useful, inconsistent with reality, or logically impossible.

> No, you seem to have missed the point; there exist RFC 3066
> implementations. Such implementations, using the RFC 3066 rules,
> could match something like "sr-CS-Latn" to "sr-CS", but could
> not match "sr-Latn-CS" to "sr-CS". By changing the definition of
> the interpretation of the second subtag, the proposed draft fails
> to be compatible with existing deployed implementations (which is
> what is meant by "backwards compatibility", which is a prime
> consideration for Internet protocols).

Ah, but RFC 3066 does not sanction use of tags like "sr-CS-Latn" without registration, and no such tags are registered. 

Because of the prevelance of implementations that use a left-prefix matching algorithm, it is more useful to combine elements in the order "sr-Latn-CS" rather than "sr-CS-Latn". If "sr-CS-Latn" were used, these implementations would fail to match "sr-CS-Latn" with "sr-Latn", which is actually a greater problem than failing to match "sr-Latn-CS" with "sr-CS".

> > At this point, I feel confident that it is not a problem to combine
> script IDs into "language" tags, and this is the consensus of the domain
> experts that have been discussing this proposed revision for the past year
> and more.
> Evidently w/o considering the implications of and for core Internet
> protocols. 

You assume this is w/o such consideration. I think otherwise. I can't say that consideration has been given to every single individual protocol. But consideration has been given to many different protocols and usage scenarios. I think it's appropriate for the onus to be on someone to identify particular problems they feel would exist with protocols that concern them (which is precisely the kind of thing we have last-call announcements for).

> If script *can be* specified in a language tag *between*
> the language code and country code, then a parser must be able to
> recognize that case and deal with it appropriately (which, as noted
> above, existing RFC 3066 implementations in deployed use do not and
> cannot do) at *any* time and in any context (context may not be
> available when a Content-Language field is parsed). 

As described above, I think this argument is invalid.

> I don't have an
> issue with provision for specification of script where appropriate,
> but for crying out loud, at least do so in a compatible manner (e.g.
> a Content-Script field) rather than a) breaking compatibility with
> deployed protocols and b) burdening applications which need not be
> concerned with script from having to parse script information.

I've stated that the imputed back-compat problem is a non-issue. Lot's of consideration was given early on to this. If you want to press this argument, I think you need to show exactly how a problem would result in realistic usage scenarios.

Can you identify for us an Internet protocol that would not be concerned with script distinctions? 

Can you identify an Internet protocol for which matching algorithms imply that "sr-CS-Latn" makes better sense than "sr-Latn-CS"?

> > > > There is a clear need for script codes...
> >
> > > But none of that applies to an audio file of spoken material,
> > > where script would be superfluous...
> >
> > Not a problem: the proposed revision *allows* for the use of script IDs
> but does not require them.
> Yes, it's a problem. Having allowed them, each parser must be able
> to handle them.

Look, they're already there in registered tags. This draft isn't doing anything new in that regard.

> > > and, as noted above, would
> > > lead to loss of backwards compatibility.
> >
> > But, as noted above, this is not an issue that is peculiar to the
> proposed revision -- it already existed in RFC 3066.
> No, given a primary subtag which is a language code (and per RFCs
> 1766 and 3066, that's any primary subtag with 2 or more (RFC 3066
> only, more being limited to 3) characters), the second subtag --
> in either RFC 1766 or RFC 3066 language tags -- is always a country
> code and never a script code.

Go back and read RFC 3066 again. It does not impose that constraint:

   The following rules apply to the second subtag:

   - All 2-letter subtags are interpreted as ISO 3166 alpha-2 country
     codes from [ISO 3166], or subsequently assigned by the ISO 3166
     maintenance agency or governing standardization bodies, denoting
     the area to which this language variant relates.

   - Tags with second subtags of 3 to 8 letters may be registered with
     IANA, according to the rules in chapter 5 of this document.

It must be a country ID *if* it is two letters, but not otherwise.

> The proposed draft pulls the rug out
> from under existing parsers by changing that.

You are completely mistaken on this point -- the proposed draft does not change the constraint you assumed as that constraint never existed.

> Again you seem to be conflating established Internet Standards Track
> protocols with "applications"

I apparently am using "applications" in a sense you're not familiar with. I don't think it's that uncommon to refer to a specification A that makes use of another specification B as an application of B.

> and ignoring the critical importance of
> backwards compatibility.

As stated earlier, I quite disagree that back-compat issues have been ignored.

> > Note that there is nothing that prevents other applications from using
> other matching algorithms, including perhaps something that is able to
> recognize in "az-AZ" and "az-Latn-AZ" that both involve Azeri and used in
> Azerbaijan.
> The issue at hand is the existing deployed base of RFC 3066
> implementations that depend on the matching algorithm specified
> therein (which doesn't work with a script tag interposed between
> language code and country code).

You say that these do not work; these implementations will still work, but they will match "sr-Latn" but not "sr-CS" with "sr-Latn-CS". If that is a problem, please explain why.

> > This is all a discussion we on the IETF-languages list went through five
> years ago, and in the intervening five years I think we have reached a
> consensus on these issues, that consensus being reflected in the proposed
> revision to RFC 3066. (Note that we made the relevant decisions over a
> year and a half ago when we reached a consensus to register az-Latn etc.
> The precedent was established then; the proposed revision adds nothing new
> in this regard.)
> As previously noted, that is a danger recognized by RFC 2026 in
> activity that does not conform to IETF procedures; it is
> possible to reach good consensus on the wrong approach.

Well, that potential was created when RFC 1766 was first approved. Tags like az-Latn could have been registered under the terms of that RFC just as readily as RFC 3066.

But you are speaking as though it's a problem that these tags are registered. I have no idea why.

> > 7.1 says...

> > The proposed revision does not create Internet-specific versions of ISO
> standards...

> By cherry-picking, it effectively seeks to establish such a version.

I would not call what is done "cherry-picking". Any identifier defined in the source standard is valid for use, except in the case that the identifier was previously defined with a different meaning in that ISO standard. That isn't cherry-picking; that is a blindly-applied general principle, created with reasoned motivation: to provide stability.

But speaking of selective usage, have you noticed that RFC 3454 identifies specific characters from ISO/IEC 10646 as prohibited? Various space and control characters are not permitted, INVISIBLE TIMES isn't permitted, END OF AYAH isn't permitted, COMBINING GRAVE TONE MARK isn't permitted... How is what is proposed in this draft any more "cherry-picking" than that?

> > 10.1 states a general policy regarding IP...

> The ISO, as developers of ISO 639 and 3166, have rights. In particular,
> they have the right to determine what those standards specify -- in
> whole -- and they have the right to revise and amend those standards,
> and are the sole arbiters of what is (and what is not) "valid".

They certainly have and retain rights over standards for language, script and country identifiers. They do not, however, determine what is valid for use in Internet protocols. Just as it is appropriate for an IETF document RFC 3454 to specify for particular reasons that certain encoded entities of ISO/IEC 10646 are not valid for Stringprep output, so also it is appropriate for an IETF document to specify for particular reasons that certain encoded entities of an ISO standard are not valid for use in language tags used on the Internet.

Peter Constable

More information about the Ietf-languages mailing list