New Last Call: 'Tags for Identifying Languages' to BCP

Sun Dec 12 17:46:52 CET 2004

>  Date: 2004-12-10 22:37
>  From: John Cowan <jcowan at reutershealth.com>
>  
> Bruce Lilly scripsit:
> 
> > It's not clear to me that the proposal will provide protection
> > against the whims of politicians.  If the definition of "CS" as
> > a country code changes again under the proposed scheme,
> > how is one to determine specifically what some archived
> > language-tag referred to at some point in time?  I'm not
> > particularly concerned about that problem, as I am resigned
> > to instability associated with anything specified by politicians
> > (and that includes the UN region codes).
> 
> The U.N. Statistics Division are only "politicians" in the sense
> that IETF WG members are.  They are, in fact, statisticians.
> Their track record for stability is considerably longer than the
> IETF's.

I hope that I need not repeat any of the well-known remarks
about "statistics".  Nor that I need point to the many uses
by politicians of statistics (and statisticians) for
political purposes.

Moreover, the point is that countries do change, and that use
of country codes (as provided for in RFC 3066 and in the
proposed draft) carries with it the inherent instability
which is characteristic of politics.  A quest for "stability"
of countries seems Quixotic and oxymoronic.  According to the
principle of stability as that term is used in defense of the
draft, I suppose we're all intended to refer to Malawi as
"Rhodesia" because that's what it (in part) was called 50 years
ago, or that we're supposed to ignore the breakup of the USSR,
Yugoslavia, etc., the reunification of Germany, etc.

A related problem with the use of country codes in language
tags is that there is not necessarily an inherent relationship
between language and country borders.  The borders of Germany
have changed many, many times.  If one is referring to the
German language as spoken by inhabitants of Alsace, using
country codes would imply that that same language spoken by
the same people would have been tagged at various times as
de-DE and de-FR according to where the France-Germany border
happened to have been determined by politicians of the time.
That strikes me as being a rather silly way to tag language,
but that's the precedent set by RFC 1766.  As far as I can tell,
the draft doesn't really deal with the issue of changing borders
or changing country names -- it merely pretends that these
things don't happen by attempting to declare a snapshot of the
status at some point in time as being valid for all time.

> > But if the proposed new registry's description of "CS" says
> > "foo" and the ISO standard code list says "bar", what's
> > an implementor supposed to present to a user as *the*
> > description associated with "CS"?
> 
> The former.  That's the whole point of having a registry.

But the user has indicated that he speaks French, and the
proposed registry contains a description in English only.
Where is the implementor supposed to get the *official*
translation for display?  N.B. under the current (RFC 3066)
situation, the definitive ISO lists provide an official
description in French.

> > One possibility would be two description fields.  
> 
> Why two?

There are now two in the ISO lists (and, as noted, in the
UN list).  I have no objection to more, but I object to
a reduction.  The text accompanying the new last call
states:

"This specification addresses each of these issues with a simple, elegant design
that is compatible with existing language tags and implementations."
and
"One concern that is crucial to acceptance of the new language tag design is how
it works with existing implementations of RFC 3066 and how existing
implementations will interact with implementations of the newer language tags."
and
"It is important to recognize that all language tags that were valid under the
existing RFC 3066 will remain valid, with their meanings intact, under this
specification."

I have an implementation which (in accordance with RFC 3066)
uses the official ISO lists. It has provision for displaying
ISO 639 language tags with their descriptions in either of the
two languages supported by the official 639 lists, and likewise
for the ISO 3166 country codes.  The specification of the
draft is *NOT* compatible with that existing implementation
because it removes the existing functionality of official
descriptions in French of language and country codes. As a
result of that incompatibility,  the newly proposed
specification does not work with (at least that one)
existing implementation (but I agree that that is a crucial
concern).

Language tags remaining valid, I presume that the tag "sr-CS"
will continue to mean Serbian as used in Serbia and Montenegro
(officially equivalent to Serbe par Serbie et Monténégro) as that
is a valid RFC 3066 language tag and its corresponding meaning...
but I can see no evidence of that in the draft -- indeed it
appears that the draft would change that meaning significantly.

> There are 6000 languages spoken on Earth, of which 
> perhaps 600 have a standard written form.

ISO 639 lists about 650, not precisely 6000.

It might be worthwhile considering the differences in the
way languages tags are used, by whom they are used, and for
what purpose.  There may well be a substantial difference
between use of a tag to represent an obscure dialect of a
dead language in a research paper vs. tagging a piece of
text in one of the core Internet protocols such as SMTP.
The draft seems to ignore the needs of the core Internet
protocols (e.g. unbounded tag length which is incompatible
with those protocols).

> What is supposed to 
> be privileged about English and French?

They happen to be the languages in which international
standards (q.v. the ISO and UN lists) are published. If
one is going to use those standards (or a snapshot of them)
as a basis for subtags, then one ought to preserve the
standardized descriptions in the offcial languages of those
standards rather than discarding all but one of them in a
fit of Anglo-centrism.

> > Eliminating bilingual descriptions for the language,
> > country (and UN region) codes leaves implementors
> > in a quandary.
> 
> Only for those implementers to whom English and French, but
> no other language, is essential.

Implementors of RFC 3066, where the relevant standards
provided official bilingual descriptions of the country
and language codes.  Something which the "new last call"
text states that the draft proposal "is compatible with",
but which is not evident in the substance of the proposal. 

> > ABNF from the draft:
> 
> You're technically right, but your underlying claim (that RFC 3066 tags are
> bounded in length) is false, as has been shown

One part of my claim is that non-private-use RFC 3066 tags
up to the present time are no longer than 11 octets in length.
As the draft, if/when approved, would close that registration
process, that limit (unless a longer tag is registered in
the interim) would apply for all time.  The other part of my
claim is that under the proposed scheme, non-private-use tags
become unbounded in length, and that is incompatible with
existing Standards Track RFCs (821, 822, 2821, 2822, 2047,
2231, 3282 among them) and the core Internet protocols
which they specify.

> and the "grandfathered" 
> production is only used to match certain existing registered RFC 3066
> tags as they appear in the registry.

Then the ABNF for that production should match those
"certain existing registered RFC 3066 tags as they appear
in the registry" , and not match unbounded-length subtags,
non-alphabetic primary subtags, zero-length subtags, dangling
hyphens, etc.; I don't want that ABNF to be used as an excuse
for a future revision to introduce such constructs officially
on the basis that they are permitted by that ABNF.