Sample IANA language subtag registry

Thu Jul 8 18:16:48 CEST 2004

Peter Constable <petercon at microsoft dot com> wrote:

> Under the terms of the new RFC the only subtags that should need to be
> registered are the language IDs that are not in ISO 639, or variant
> IDs.
>
> There is no need to create a registry of existing ISO 639 IDs, ISO
> 3166 IDs, or ISO 15924 IDs.

Section 3 says there is:

"The previous registry under RFC 3066 contained only a few registered
tags. The new registry, under this document, contains a comprehensive
list of all of the subtags valid in language tags. This allows
implementers a straightfoward [sic] and reliable way to validate
language tags." (page 20)

Section 3.2 (page 23) shows an example of the new registry format,
including existing ISO 639, ISO 15924, ISO 3166, and UN IDs.

> We should not be providing lists that mirror the code tables for those
> standards. The *only* reason to provide a list of ISO 639 IDs or ISO
> 3166 IDs, etc., would be if we explicitly wanted to limit the accepted
> values from one of those sources (such as I suggested at one point
> that we do for ISO 639-1). If we simply mirror what is published
> elsewhere, then we will inevitably create synchronization problems.

But there IS a desire to limit the accepted values.  Users may not use
"CS" to mean "Serbia and Montenegro," for example, even though that is a
perfectly acceptable code element as far as ISO 3166 is concerned.
There will probably be other examples in the future, as ISO 3166
considers its job to encode "names of countries" and makes changes in
codes to reflect changes in names.

And there is now also this concept of "canonical" and "alias" subtags,
such that the long-deprecated "iw" is acceptable for "Hebrew," but
considered to be an alias for the canonical subtag "he."  Likewise, "ZR"
is the canonical region subtag for Congo-Kinshasa, not "CD" as one would
gather from looking at ISO 3166.  At the very least, to support this
concept, you'd need a list of aliases together with their canonical
equivalents.

Mirroring the code tables is not the worst idea I've ever heard.  It
obviates the need to look in four or five separate places to find out
what subtags are valid, and it solves the critical problem that those
places *still* don't tell the full story:

* Both iw and he are valid for Hebrew, but he (the newer) is preferred.
* Both ZR and CD are valid for Congo-Kinshasa, but ZR (the older) is
preferred.
* CS is valid, but for Czechoslovakia (which is no longer listed in ISO
3166), not for Serbia and Montenegro (which is).
* DD is still valid for East Germany, although that entity dissolved 14
years ago.
* 830 is valid for Channel Islands, but only because there is no ISO
3166 code for them (which requires an exhaustive search).
* et cetera.

> And there is no reason to provide lists of IDs along with names that
> have been normalized to some constraint, such as using only ASCII
> letters. We are not providing a standard set of names for language,
> countries and scripts.

The names are indeed not standardized, only the codes.  But the draft
does say that the names are to be "transcribed into ASCII," and
transcription can occur in different ways.  For all I know, "Bokmål"
should really be transcribed as "Bokmaal."

> Sorry, Doug, but this time I think the work you have done is badly
> misguided (a rare occurrence).

(Thanks for the compliment.)

I'm only following what is clearly stated in Section 3 of the draft.  If
this is not intended, then the example in Figure 3 is very misleading,
and the text should be clarified as well.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/