Wikimedia language codes

Mon Nov 13 07:44:46 CET 2006

Gerard Meijssen <gerardm at wiktionaryz dot org> wrote:

> The Wikimedia Foundation has at this moment in time exactly 250 
> different Wikipedia projects. Some of them have a code that is 
> incompatible with any ISO-639 code of any version. There are projects 
> that have codes that are squatting on existing ISO-639 codes. There 
> are codes that have been made up that currently do not trespass on 
> what are the codes of other languages however, I would not be 
> surprised when this infringes on the terms of use of the ISO-639 
> codes. My understanding is that it is not permitted to use codes that 
> can be mistaken for valid codes.

Any of the maintenance or registration authorities for any ISO standard, 
not to mention the RFC 4646 people, will tell you it's a bad idea to 
create your own code elements in the normal code space, regardless of 
whether they collide with existing, official code elements.

ISO 639 provides a private-use space for people who need additional code 
elements: the code elements "qaa" through "qtz" may be freely used for 
this purpose.  (Of course, the things encoded should be true languages, 
in the spirit of ISO 639.)  This is a far better idea than using a real 
code element like "ltg" and hoping the ISO 639 folks never formally 
assign it.

RFC 4646 provides a secondary private-use mechanism: all or part of the 
language tag may begin with "x-" to indicate that what follows is 
privately defined.  For example, "x-foobar" is a private tag for the 
Foobar language, while "nl-x-foobar" is some Foobar dialect or variant 
of Dutch.

Wikimedia is a large enough project, with enough knowledgeable 
participants, that somebody ought to be able to set it on the right 
course with regard to language tags.

> One of the disputes is about the Belaruse wikipedia that has been 
> squatted by people who insist on using an orthography that is not the 
> official one. There is a vibrant group of Belaruse using the official 
> orthography that wants to claim on the same domain. This is one among 
> many, most are largely political.

1.  Don't invent a new language subtag, as if the language in question 
were no longer Belarusian.

2.  Don't invent a new script subtag.  The orthography in question is 
almost certainly either Cyrillic script or Latin script.

3.  Don't invent a new variant subtag, unless you propose it on this 
list and the Language Subtag Reviewer approves it.

You can always use an RFC 4646 private-use subtag within your project to 
identify this orthography unequivocally:  "be-x-alt-ortho" or something 
more mnemonic.  This is what private-use subtags are for.

> One of our problems is not solved because you do not consider the 
> ISO-639-3 "official". This is the existence of a Wikipedia in 
> Maldovan.  What we do understand is that none of the ISO-639-3 codes 
> will ever be used other then for its defined purpose.

ISO 639-3 *isn't* official.  It is still a draft standard, and because 
of that, it must be not referenced normatively by any other standard or 
RFC.  The ISO 639-3 people would be the very first to tell you that.  I 
know there are a lot of people who don't believe that ISO 639-3 is not 
yet "live," partly because the Ethnologue data which underlies much of 
the 639-3 work has been around so long, but it is the truth.

There is an active project called LTRU, which others have mentioned, 
whose primary activity is to update RFC 4646 to take ISO 639-3 into 
account WHEN that standard is approved and published.

> An often recurring theme in our request for new projects is that 
> people claim that something is a language. It happens regularly that 
> the proponents point to what should be amounts of impressive content 
> either in archives, libraries on the Internet, all stuff  that is to 
> most of us goobledegook. Often it is claimed that they have applied 
> for recognition for their language. It does not make sense to request 
> it from anyone but Ethnologue as the ISO-639-2 is at its end of life.

First, as others have said, the claim the "ISO 639-2 is at its end of 
life" is simply not true.  ISO 639-1 and 639-2 will continue to be 
supported and will continue to find substantial use after 639-3 is 
published.

Second, as others have also said, it can be very slippery trying to 
second-guess ISO 639/JAC as to what is and isn't a language.  It might 
be best in your position to assign private-use tags to cover the 
languages that ISO does not recognize but that your users claim (for 
whatever reason) is a language.  If someone comes along and does a full 
version of WiktionaryZ in Unilingua (Mirad), you will be better off 
tagging it "x-mirad" than doing anything else.

> There was some earlier discussion of the Min-Nan language on this 
> mailing list. For your information both the Min-Nan Wiktionary and 
> Wikipedia are not in either the Hant or the Hans script, it uses 
> Latn.When you start off from zh as the basis you insist on and equally 
> the people who write Min-Nan without exception use Latn, the code 
> zh-nan-Latn is not logical at all. NB these are really active 
> projects.

What is illogical about "zh-nan-Latn"?  Is it that the script subtag 
"Latn" is unnecessary because Min Nan speakers use Latin?  If that is 
the case, could you not simply write "zh-nan"?

> For the Wikimedia Foundation there are a number of options;
>
>    * We use our WMF language codes  internally and externally. This is
>      imho from a standards point of view a worst case scenario

Close, if not "the" worst.

>    * We use our  codes internally and externally we advertise the
>      "official" codes.

If you do that, you might cause 1-to-1 correspondence problems for 
yourself.

>    * We sanitise our codes so that there is at least no conflict with
>      the ISO-639 codes. We use them internally and we advertise the
>      "official" codes.

The rule is simple:  *Do not* use unassigned, non-private-use ISO 639 
codes as if they were assigned, for any reason.  There is no 
justification for doing so.

>    * We move away from our current codes and only use "official" codes
>      both internally and externally.

Naturally, I'd recommend that approach.

> There are at least two lists I would like to have that would help:
>
>    * A list with the all the ISO-639 codes (1, 2 and 3) and the codes
>      that these languages have under RFC 4646.
>    * A list with the WMF language codes and the language codes under
>      RFC 4646.

If you choose to follow RFC 4646, you want the Language Subtag Registry:
http://www.iana.org/assignments/language-subtag-registry

This will save you from having to worry about whether to support "eng" 
as well as "en" (you don't) and will answer other questions.

If you choose not to follow RFC 4646, but only the underlying ISO 
standards, then you should use the ISO code lists directly.  The only 
real differences between the ISO 639 code list and the list of language 
subtags in the Registry are that the Registry includes subtags that have 
been removed from ISO (and marks them "deprecated"), and does not 
include alpha-3 subtags for languages that have an ISO 639-1-based 
alpha-2 subtag.

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages