New Last Call: 'Tags for Identifying Languages' to BCP

Mon Dec 13 08:17:29 CET 2004

I've had a hard time indeed keeping up with the list traffic resulting
from Bruce Lilly's comments and responses thereto.  At last count, there
have been 44 messages (in 11 digests) within the past 58 hours, with
more undoubtedly coming as I write this.

So a lot of what I write here will probably be outdated on arrival.
Some is already intended as clarifications and amplifications of what
others have written.  I apologize for any *needless* redundancy, while
noting at the same time that some points apparently need to be made
again and again.  :-)

Topics will be highlighted ** like this ** with discussion following.

** Registry available in English only, not French **

Neither RFC 1766, RFC 3066, nor the current draft (RFC 3066bis), nor for
that matter the relevant ISO standards, specifies "official" or
"normative" or "definitive" names for languages, scripts, countries, or
anything else.  Although an implementer is free to use these names in
their applications, as I have chosen to do in mine, there is no
requirement to do so.  Many, in fact, will choose to use more familiar
names, such as "Libya" instead of "Libyan Arab Jamahiriya," or "Laos"
instead of "Lao People's
Democratic Republic," or "North Korea" instead of "Korea, Democratic
People's Republic of."

The names in the ISO standards are not "definitive" under RFC 3066, as
stated.  Only the codes are.  RFC 3066 says the subtags are to be
"interpreted as codes from" or "interpreted according to assignments
found in" the ISO standards, but that does not mean the *exact wording*
of the descriptions is normative.  It means that the intent of the
description must be maintained.  This should not be a hard concept to
grasp.

For example, if you have the region subtag "GB" -- or, for that matter,
the underlying ISO 3166-1 code "GB" -- you are free to associate any of
the following names with it, to the extent you believe these names are
equivalent:

- Great Britain
- (the) United Kingdom
- (the) United Kingdom of Great Britain and Northern Ireland
- UK
- Blighty

What you are *not* free to do is associate GB with Gabon or Gambia or
some other country that is not the UK.  (You should probably also avoid
"England," as doing so will annoy the Scottish and Welsh and a good many
of the Northern Irish.  "Blighty" may have the same problem.)  This is
the intent of not only RFC 3066bis, but also RFC 3066 and 1766.

Having said this, I actually do support the addition of a
French-language description field, or, alternatively, the co-publication
of a French-language registry that would differ from the
English-language registry *only* in the description field and comments.
I am NOT in favor of this because I believe the draft is irreparably
broken without it, but simply out of an interest in maintaining
something in the registry that is present in the underlying ISO
standards.  But note that the registered tags in RFC 3066 (as opposed to
the ISO-based ones) are already English-only, as many have stated.

Mark Crispin responded to Bruce:

>> SO where are the French definitions?
>
> Ask a person who is bilingual in English and French to provide one.

As a matter of fact, I actually did send the registry to two
Québec-based francophones, who have helped me on projects like this in
the past, and asked them to translate the descriptions of registered
tags and "deprecated" or "withdrawn" ISO-based subtags.  I gave them a
head start by filling in the descriptions of non-withdrawn ISO-based
codes, based on the existing ISO standards.  Unfortunately, that was in
early August, and they still have not responded.  So you can't say it
hasn't been tried!

In summary, I tend to agree with Bruce that French descriptions should
be made available.  It certainly wouldn't hurt anything to do so.
However, I disagree VERY STRONGLY that their absence is some kind of
major, critical flaw that undermines the entire draft and demonstrates
underlying anglocentrism or francophobia.  Half of Bruce's discussion
has been about transforming this molehill into K2.

** ABNF for grandfathered tags **

As many people have stated, the ABNF for the "grandfathered" production
is only intended to represent the *context-free syntax* of tags carried
over from RFC 3066.  It applies only to those 46 existing tags, plus any
others that may be approved before RFC 3066bis goes live.  It is
absolutely subject to that constraint; there is no other type of tag or
subtag that could possibly take advantage of the "grandfathered"
construction.

It would probably be a good idea for the "grandfathered" production in
RFC 3066bis to be the same as RFC 3066 for these tags.  The important
point, however, is that this is *completely moot" because the
"grandfathered" production is, by definition, limited to those tags
carried over from 3066, where the stricter ABNF applied.  "a123-xyz" and
all the others Bruce mentioned are not permitted under RFC 3066, so they
cannot occur in 3066bis tags either.

ABNF is excellent for indicating context-free grammars, but not all
grammars are context-free.  This is a prize example.  It is not at all
"a very different grammar from
RFC 3066" and does not impose dramatic new constraints on parsers (it
didn't on mine).  RFC 3066 had many specific constraints involving the
sources of subtags and the need to register tags not covered by the
source standards, just as RFC 3066bis does.

Bruce wrote:

> I don't want that ABNF to be used as an excuse
> for a future revision to introduce such constructs officially
> on the basis that they are permitted by that ABNF.

Nothing in this or any other document can undertake to prevent a
subsequent revision from doing whatever the heck it likes.  If this
group or another group decides, 10 years from now, that they want to
allow language tags with empty subtags, or 100-letter subtags, they
won't need to look to this document for "permission" before making that
change.

Summary:  Peter Constable's suggestion, to convert the ONE line of ABNF
in RFC 3066bis that deals with grandfathered tags, should be taken.  No
other action on this overblown topic should be taken or considered.

** Maximum length of tags under RFC 3066bis **

Bruce makes the claim that excessively long tags under RFC 3066bis will
break other Internet protocols, such as RFC 2047.  But neither RFC 3066
nor RFC 1766 made any mention of being constrained by these protocols,
as to length or any other consideration.  The words 'length' and 'long'
don't even appear in RFC 3066 or 1766.

As stated before, but it seems to bear repeating, RFC 3066 did not limit
any registered tag to 11 octets or any other limit.  Indeed, several
tags of 13 octets ("sl-rozaj-bisk" and brethren) were proposed in
October 2003.  They would have been perfectly legal under 3066, but were
withdrawn by the proposer, largely on the basis that a revision such as
the current one would allow the equivalent tag to be registered one
subtag at a time.

It is not true that RFC 3066 had some upper bound on the length of tags,
that is repealed by 3066bis.  The ABNF for RFC 3066:

Language-Tag = Primary-subtag *( "-" Subtag )
Primary-subtag = 1*8ALPHA
Subtag = 1*8(ALPHA / DIGIT)

very clearly does *not* indicate any limit on the number of subtags.

Bruce made a curious comment, that RFC 3066bis "encourages use of more
subtags."  The existence of new types of subtags (extended-language,
script, variant, extension, and private-use) does not mean the draft
"encourages" excessive detail in tagging; it merely accommodates this
kind of detail if someone needs it.  There is certainly no
recommendation or suggestion that users engage in silliness like
"en-boont-boont-boont-etc." or "sr-CS-891-boont-gaulish-guoyu-etc."  (In
fact, "891" has not been proposed to be a valid subtag for quite some
time; how old IS the draft he is reading?)  Ironically, despite his
comments about the whims of politicians, it is ordinary users that Bruce
does not seem to trust to Tag Content Wisely.

** Stability and RFC 3066bis **

Bruce asserted multiple times that primary language subtags must be
identical to ISO 639 language codes:

> Of course, it is a more serious defect of the proposal
> that it would fail to reflect internationally-agreed
> codes and would fail to keep pace with changes...

An explicit goal of the draft is to follow the international codes (ISO
639 and others) except when they introduce instability, and to provide a
sensible fallback when they do.  It "keeps pace with changes" by
assessing their impact on a one-off basis and using them or working
around them as appropriate.  It does not blindly accept the changes made
by the "politicians" for whom Bruce holds such contempt.  (I would have
thought he would consider this a good thing!)

Probably the one feature of RFC 3066bis that Bruce will latch onto
strongest is that it returns the region subtag CS to the meaning
"Czechoslovakia" (note I did not say that name is "official" or
"definitive"; it could also be the Czech and Slovak Federative
Republic).  That means the tag "sr-CS" under RFC 3066bis will mean
Serbian as spoken in Czechoslovakia, *not* Serbia and Montenegro.  Does
this break RFC 3066?  I doubt it.  Because of the furor that surrounded
the reassignment of CS, there's probably at least as much data on the
Internet that uses CS to mean Czechoslovakia as Serbia and Montenegro,
if not more.  Both meanings are undoubtedly in use, which is a source of
confusion that RFC 3066bis attempts to rectify.  It is a matter of
locking the stable doors after *one* horse has escaped, rather than
letting all the others escape while we wring our hands about the first
one.

John Cowan wrote:

> The CS case is particularly gratuitous, as its denotation changed from
> "Czechoslovakia" (a no longer existent country) to "Serbia and
Montenegro"
> (a newly created country).

This is only part of the problem with CS; the rest of the problem is
that "CS meaning Serbia and Montenegro" was added only 10 years after
"CS meaning Czechoslovakia" was removed.  10 years may be a long time as
far as computer technology is concerned, but it is not long at all as
far as existing data stores are concerned.  Unlike the other reused ISO
3166 codes, there were actual domain names that ended with ".cs".  Many
people, admittedly not all the brightest lights in the harbor, still
refer to Czechoslovakia as if it were still a single country.
Reallocating CS, despite its apparent mnemonic value, was a critical
blow to stability, and in many ways it served as the poster child for
the movement to update RFC 3066bis.

Not all protocols and standards follow the ISO codes precisely.  The
following exchange between Peter and Bruce was quite revealing:

>> The usability flaw in treating ISO 639 and ISO 3166 as human-readable
>> is evident in the confusion between ja and JP (or is it jp and JA?),
>> and GB vs UK.
>
> Without looking I can easily tell that jp and uk are country
> codes precisely *because* they are well-known as TLDs.

BZZZT!  Wrong!  Thank you for playing.

UK is **NOT** an ISO 639 country code; GB is.  That was Peter's entire
point.  The TLD mechanism deviates from ISO 639 in using ".uk" to
represent the UK instead of ".gb" (although there are reportedly also
some .gb domain names).  Does that make the TLD mechanism horribly
broken, or unresponsive to change?  Is IANA sticking its fingers in its
ears and going NA-NA-NA for not changing the TLD for the United Kingdom
to ".gb"?

** Accessibility of base standards **

Thankfully, the repetitive hammering on this topic seems to have faded.
The issue of "accessibility," as I see it, is really a derived
requirement from the desire to allow "deprecated" or "withdrawn" ISO
codes to remain valid subtags, in the interest of stability.  For
example, "iw" is a valid alias for "he" (both meaning Hebrew), and "NH"
(New Hebrides) is a valid alias for "VU" (Vanuatu), both by design.

The problem is that the job of the ISO maintenance agencies and
registration authorities is to maintain current, up-to-date code lists.
That means retiring the old codes that have been superseded.  But if
these old codes are still valid for language tags, and there is no
registry, where do you go to find them?  For language codes you must
visit the "ISO 639-2/RA Change Notice" page on the Web, while for
country codes you must either procure a copy of ISO 3166-3, "Code for
formerly used names of countries," at a cost of 67 CHF, or resort to an
unofficial listing such as Clive Feather's.

This is the real "accessibility" problem.  It would not exist if we
wanted to sacrifice stability by disallowing "outdated" subtags that
have been retired in the ISO standards.

** Names constrained to ASCII **

I can see a modest benefit in allowing names, even in en-Latn, to
contain non-ASCII characters.  "Côte d'Ivoire" is the best-known
example.  Again, though, since the purpose of the description field is
to *identify* the language/script/country/whatever being coded, NOT to
provide an official name for it, this is hardly a major limitation.  It
was probably intended to ensure readability on all platforms, regardless
of their underlying encoding.

** Reference to ISO 8601 **

It is true that ISO 8601 allows a bewildering variety of date formats,
and that the draft should indicate specifically that the format
yyyy-mm-dd, with which ISO 8601 is most commonly associated, is
intended.  Again, this is easily rectified.  really not an indication of
a horribly broken draft, since dates appearing in the registry are such
a small detail of the RFC 3066bis framework -- you can ignore them
completely unless you are creating a validating processor.

** Random quotes and responses **

Here are a few additional noteworthy items I have picked out of 44
messages.

>> - vernacular granularity has nothing to do with geography and
>> countries.
>
> True in general; but can we reverse the precedent
> set by RFC 1766?

This seems like a strange request from someone who is concerned about
breaking existing protocols.

> The draft in question apparently seeks to get IANA into the
> business of defining countries (and languages), usurping
> those roles from ISO (as also noted in RFC 1591).

RFC 3066bis is very explicit (Section 3.2) about the conditions that
control which subtags are assigned to languages, scripts, and countries.
Only a deliberate misreading of the draft could lead one to believe it
has anything to do with defining the languages, scripts, and countries
themselves.

> I hope that I need not repeat any of the well-known remarks
> about "statistics".  Nor that I need point to the many uses
> by politicians of statistics (and statisticians) for
> political purposes.

This venting about politics and politicians and statisticians may be a
good emotional release, but it doesn't really accomplish anything as far
as the draft is concerned, does it?

** Conclusion **

There are serious user requirements RIGHT NOW for some of the key
features of RFC 3066bis, notably the ability to specify script as a part
of a language tag.  There is also confusion over the meaning of the
region subtag CS, which the draft resolves (though perhaps not in the
way Bruce would like).

If deemed necessary, it would seem reasonable to:

(1) add wording to clarify that subtag descriptions are not official or
normative (and never have been)

(2) specify the rationale for making the registry English-only, instead
of making it look like an oversight

(3) fix the "grandfathered" ABNF to make it conform to RFC 3066 (even
though it is not necessary)

(4) specify that the yyyy-mm-dd format (not just "ISO 8601") is used for
dates in the registry

so long as these minor changes do not delay the acceptance of the new
RFC.

RFC 3066bis has been discussed for over a year and has benefited from
the input of many veterans and newcomers to the IETF process.  There are
good and valid reasons for everything it adds and changes: the registry,
the stability safeguards, the new subtag types.  It is the product of
extensive experience, thought, and informed debate, and will enhance the
usability of language tags.  It should be approved with, at most, the
editorial changes listed above.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/