Wikimedia language codes
Doug Ewell
dewell at adelphia.net
Mon Nov 13 07:44:46 CET 2006
Gerard Meijssen <gerardm at wiktionaryz dot org> wrote:
> The Wikimedia Foundation has at this moment in time exactly 250
> different Wikipedia projects. Some of them have a code that is
> incompatible with any ISO-639 code of any version. There are projects
> that have codes that are squatting on existing ISO-639 codes. There
> are codes that have been made up that currently do not trespass on
> what are the codes of other languages however, I would not be
> surprised when this infringes on the terms of use of the ISO-639
> codes. My understanding is that it is not permitted to use codes that
> can be mistaken for valid codes.
Any of the maintenance or registration authorities for any ISO standard,
not to mention the RFC 4646 people, will tell you it's a bad idea to
create your own code elements in the normal code space, regardless of
whether they collide with existing, official code elements.
ISO 639 provides a private-use space for people who need additional code
elements: the code elements "qaa" through "qtz" may be freely used for
this purpose. (Of course, the things encoded should be true languages,
in the spirit of ISO 639.) This is a far better idea than using a real
code element like "ltg" and hoping the ISO 639 folks never formally
assign it.
RFC 4646 provides a secondary private-use mechanism: all or part of the
language tag may begin with "x-" to indicate that what follows is
privately defined. For example, "x-foobar" is a private tag for the
Foobar language, while "nl-x-foobar" is some Foobar dialect or variant
of Dutch.
Wikimedia is a large enough project, with enough knowledgeable
participants, that somebody ought to be able to set it on the right
course with regard to language tags.
> One of the disputes is about the Belaruse wikipedia that has been
> squatted by people who insist on using an orthography that is not the
> official one. There is a vibrant group of Belaruse using the official
> orthography that wants to claim on the same domain. This is one among
> many, most are largely political.
1. Don't invent a new language subtag, as if the language in question
were no longer Belarusian.
2. Don't invent a new script subtag. The orthography in question is
almost certainly either Cyrillic script or Latin script.
3. Don't invent a new variant subtag, unless you propose it on this
list and the Language Subtag Reviewer approves it.
You can always use an RFC 4646 private-use subtag within your project to
identify this orthography unequivocally: "be-x-alt-ortho" or something
more mnemonic. This is what private-use subtags are for.
> One of our problems is not solved because you do not consider the
> ISO-639-3 "official". This is the existence of a Wikipedia in
> Maldovan. What we do understand is that none of the ISO-639-3 codes
> will ever be used other then for its defined purpose.
ISO 639-3 *isn't* official. It is still a draft standard, and because
of that, it must be not referenced normatively by any other standard or
RFC. The ISO 639-3 people would be the very first to tell you that. I
know there are a lot of people who don't believe that ISO 639-3 is not
yet "live," partly because the Ethnologue data which underlies much of
the 639-3 work has been around so long, but it is the truth.
There is an active project called LTRU, which others have mentioned,
whose primary activity is to update RFC 4646 to take ISO 639-3 into
account WHEN that standard is approved and published.
> An often recurring theme in our request for new projects is that
> people claim that something is a language. It happens regularly that
> the proponents point to what should be amounts of impressive content
> either in archives, libraries on the Internet, all stuff that is to
> most of us goobledegook. Often it is claimed that they have applied
> for recognition for their language. It does not make sense to request
> it from anyone but Ethnologue as the ISO-639-2 is at its end of life.
First, as others have said, the claim the "ISO 639-2 is at its end of
life" is simply not true. ISO 639-1 and 639-2 will continue to be
supported and will continue to find substantial use after 639-3 is
published.
Second, as others have also said, it can be very slippery trying to
second-guess ISO 639/JAC as to what is and isn't a language. It might
be best in your position to assign private-use tags to cover the
languages that ISO does not recognize but that your users claim (for
whatever reason) is a language. If someone comes along and does a full
version of WiktionaryZ in Unilingua (Mirad), you will be better off
tagging it "x-mirad" than doing anything else.
> There was some earlier discussion of the Min-Nan language on this
> mailing list. For your information both the Min-Nan Wiktionary and
> Wikipedia are not in either the Hant or the Hans script, it uses
> Latn.When you start off from zh as the basis you insist on and equally
> the people who write Min-Nan without exception use Latn, the code
> zh-nan-Latn is not logical at all. NB these are really active
> projects.
What is illogical about "zh-nan-Latn"? Is it that the script subtag
"Latn" is unnecessary because Min Nan speakers use Latin? If that is
the case, could you not simply write "zh-nan"?
> For the Wikimedia Foundation there are a number of options;
>
> * We use our WMF language codes internally and externally. This is
> imho from a standards point of view a worst case scenario
Close, if not "the" worst.
> * We use our codes internally and externally we advertise the
> "official" codes.
If you do that, you might cause 1-to-1 correspondence problems for
yourself.
> * We sanitise our codes so that there is at least no conflict with
> the ISO-639 codes. We use them internally and we advertise the
> "official" codes.
The rule is simple: *Do not* use unassigned, non-private-use ISO 639
codes as if they were assigned, for any reason. There is no
justification for doing so.
> * We move away from our current codes and only use "official" codes
> both internally and externally.
Naturally, I'd recommend that approach.
> There are at least two lists I would like to have that would help:
>
> * A list with the all the ISO-639 codes (1, 2 and 3) and the codes
> that these languages have under RFC 4646.
> * A list with the WMF language codes and the language codes under
> RFC 4646.
If you choose to follow RFC 4646, you want the Language Subtag Registry:
http://www.iana.org/assignments/language-subtag-registry
This will save you from having to worry about whether to support "eng"
as well as "en" (you don't) and will answer other questions.
If you choose not to follow RFC 4646, but only the underlying ISO
standards, then you should use the ISO code lists directly. The only
real differences between the ISO 639 code list and the list of language
subtags in the Registry are that the Registry includes subtags that have
been removed from ISO (and marks them "deprecated"), and does not
include alpha-3 subtags for languages that have an ISO 639-1-based
alpha-2 subtag.
--
Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages
More information about the Ietf-languages
mailing list