Sample IANA language subtag registry

Sun Jul 11 21:25:58 CEST 2004

Peter Constable <petercon at microsoft dot com> wrote:

> As long as the RFC says that the source for particular sub-tags is ISO
> 639, then there will be people that go to that source to look for IDs
> rather than a data file on the IANA site (and likewise for 3166 and
> 15924).

We're already not using ISO 639 and 3166.  We're using code lists that
are *derived* from ISO 639 and 3166, but with significant additional
rules:

* Most ISO codes that have been removed from their respective standard
may still be used, with their original meaning.  For example, "iw" for
Hebrew.  This CANNOT be derived from the standard alone.

* Exception: those that were reassigned to another meaning far enough in
the past.  For example, "AI" for French Afars and Issas was reassigned
to Anguilla, and we now use that new meaning; but "CS" for
Czechoslovakia was reassigned to Serbia and Montenegro but we still use
the old meaning.  The difference has to do with a cutoff date, and also
cannot be derived from looking at the ISO standard.

* Some codes are "canonical" and preferred, while others are "aliases"
that are allowed but may be converted to their canonical equivalents.
For example, "ji" and "yi" both are OK and mean Yiddish, and must be
folded together (to "yi") in comparison.  This could ALMOST be inferred
from looking at ISO 639, but certainly not from ISO 3166.

* Again, depending on the cutoff date, sometimes the older code is
canonical, sometimes the newer code is.  For example, "TP" (which
originally meant "Portuguese Timor") is the canonical form of "TL"
("Timor-Leste").  You won't learn this by looking at the ISO standard
either.

* As for the UN M.49 codes, there are two main categories -- country
codes and macro-regional codes -- and in each category, there are some
codes that may be used and some that MUST NOT be used.  The UN code
lists don't say which are which.  This is why I originally put together
a list sorting it all out.

The only reason ISO 15924 isn't subject to the same confusion is that it
was designed in "modern" times, by a group (and led by someone) who
understood the problems with changing codes.  If ISO 15924 had been
around for 30 years, it might have had some of the same issues (for
example, "Burm" might have been replaced with "Mymr") and we would be
talking right now about one code being the canonical form of the other.

> So you won't get around the problems of synchronization and
> consistency with others unless you remove those references from the
RFC
> and instead say that the source of the subtags is the data file. If
> that's what you want to happen, then you should prepare a draft that
> reflects it.

Peter is correct.  The draft should state that the codes are *derived
from* the ISO and UN standards, but with specific rules governing which
codes are allowable and not, and what they mean.  The language subtag
registry should be considered the source of the subtags, and billed as
such.

The registry will need to be updated promptly in response to changes to
the respective ISO standards.  But I think it is simply asking too much
for individual users to reference the ISO standards and then, as Mark
explained, walk through the history to figure out what's going on.

> If you want to see ISO 639 IDs published in a machine-readable data
> file, it is reasonably likely that the ISO 639/RA-JAC would be willing
> and able to accommodate that request.

All three ISO standards are already available as text files:

http://www.loc.gov/standards/iso639-2/ISO-639-2_values_8bits.txt
http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1-semic.txt
http://www.unicode.org/iso15924/iso15924.txt.zip

That's not good enough, though, because these lists don't include the
"withdrawn" and "deprecated" and "changed" codes that are also valid in
language tags.

Michael Everson <everson at evertype dot com> wrote:

>> Mirroring the code tables is not the worst idea I've ever heard.
>
> It is a maintenance nightmare. Don't try it. I have.

I have too -- as Tex pointed out later, anyone who writes software to
create or interpret language tags has to do this anyway -- and it turns
out I've done better at times than the respective RA or MA.  Sometimes
they update one reference without updating the other.  (WatchThatPage is
a wonderful tool when aimed at (a) the main code list and (b) the
change-notice page.)

I shouldn't have used the word "mirroring," though, because that's not
what this is.  The set of codes usable in RFC 3066bis language tags is
nothing less than a derivative work.

To keep this from being a maintenance nightmare, it's going to be more
important than ever to maintain a *single* registry.  Right now there
are three online sources for RFC 3066 registered subtags:

[1] "RFC 3066 Language code assignments"
http://www.evertype.com/standards/iso639/iana-lang-assignments.html

[2] "Directory of language tag applications"
http://www.iana.org/assignments/lang-tag-apps.htm

[3] "LANGUAGE TAGS"
http://www.iana.org/assignments/language-tags

and out of these three references:

* sgn-ZA is listed in [2] and [3] but not [1]
* sl-rozaj is listed on [1] and [3] but not [2]
* uz-Arab is listed on [1] but not [2] or [3] (there is no registration
form for uz-Arab at http://www.iana.org/assignments/lang-tags/uz-Arab)

Peter added:

> This is a pretty major change to be making at such a late stage when I
> thought the authors were looking for stability -- certainly we
*should*
> be stabilizing if we're trying to get this wrapped up. Apparently Mark
> and Addison aren't in as much of a hurry to get this wrapped up as I
> thought.

I think it was probably a mistake not to make a prominent public
statement on the list about the registry format, and the desire to
"register" all allowable codes (including those from ISO and UN
standards).  I think it's a great idea, and I hope it can be discussed
actively and issues resolved promptly.  Peter is right; it would be good
to see this RFC published sooner rather than later.

> The issues you have identified pertain almost entirely to ISO 3166. If
> we have problems with that standard, we should address them. What is
> being done here goes rather beyond that, however.

This *is* how the problems are being addressed.  Even if you focus
strictly on ISO 639 language codes, there are rules that go beyond the
standard: do not use an alpha-3 code if a corresponding alpha-2 code
exists; "und" and "mul" SHOULD not be used; withdrawn codes are still
allowed but not canonical (except, counterintuitively, "sh").

All of the confusion and lack of interoperability will be improved by
maintaining a comprehensive registry.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/