Wikimedia language codes
dzo at bisharat.net
Sun Nov 12 22:27:55 CET 2006
Hi Debbie, Gerard, all,
Gerard's set of questions, against the backdrop of various other subtag
discussions and of course being aware of the larger scheme for ISO-639, has
me thinking that while this specifies ever better, the logic of it sometimes
seems to go too strongly in the direction of specifications (with some
countervailing exceptions such as the es- example and in principle the
ISO-639-5). There is at the same time a risk I think in adopting a set of
language codes for a broad purpose without (1) a critical look at what it
limits as well as what it facilitates, complemented (2) by a flexible
There are cases where I think the ISO-639-3 codes would definitely not be
ideal for localization or for Wikipedia editions, for instance. Maybe
ISO-639-1/-2, or -5 would be a more appropriate grouping. ISO-639-6 lets us
be even more specific than -3, and to group subunits in different ways (as I
understand it) but for developing an interactive space on the web, such fine
distinctions could be problematic - or might require some pretty
sophisticated interdialect MT (I'm trying to imagine what is involved here).
This is not to argue against this or that but to inject a cautionary note.
What may be required is a more artisanal approach to language definitions at
least to the extent of vetting the possible issues. Take for example four
related languages in western Uganda - Nyoro, Chiga, Nyankore, and Tooro.
Each has an ISO-639-3 code. In addition, Nyoro and Nyankore have ISO-639-2
codes which correspond to the -3 codes (neither of which have a -1 code). It
might seem pretty straightforward that you have 4 categories. However they
are so closely related and interintelligible to varying high degrees such
that they might more appropriately be treated as a single unit. In fact, it
turns out that since 1990 a standardized version for all 4 has been
developed called Runyakitara. It is not yet coded in -2 or -3 (and actually
might be considered a "macrolanguage" and thus a logical candidate for
ISO-639-2). This information is not apparent from any of the available
codes. The only way to know that is to research the language(s), and the
result of the research should be a question mark and probably support for a
new code (BTW, we're aware of the need and discussing the application).
While that situation is a bit unusual in its particulars, it is not so far
off from the kind of thing one finds elsewhere. There are lots of instances
where the ISO-639-3 codes, which make sense by the research criteria they
were based on, are not the ideal level of categorization for all usages.
I am most familiar personally with Fula (I learned ffm for 2 years and then
transitioned to fuf for 2 years [an interesting process] and in those days
had occasion to speak with fuc speakers; later I interacted with fuh and fuq
speakers and various others along the way. None of which accords me any
special authority, but it definitely leads me to see that there is an
ongoing validity and utility to the ff/ful tags from ISO-639-1&2. At the
same time I don't know what the right level of specificity should be with
regard to Fula and its variants for all purposes. In the case of a localized
software it would be crazy to have 8 separate versions by locales based on
ISO-639-3; but it might be equally unworkable to have a single localization
that includes all varieties including both outliers (my characterization) -
fuf and fub. Typically the varieties of Fula are divided into western and
eastern, which may or may not be of help. I don't think anyone now can say
what the ideal solutions are for Fula in ICT and cyberspace
In the case of Wikipedia, one might imagine a solution for Fula being either
to let it evolve and see as is on ff.wikipedia.org, or perhaps to prescribe
something along the lines of a gateway at ff.wikipedia.org and some
flexibility for dialect-specific text under that. Just a couple of ideas,
which probably ought to be considered along with other aspects of
localization, language policies, linguistics, etc. And whatever that
solution is, it might not be an appropriate model in the case of another
language like Runyakitara. Or Manding.
Since I also speak Bambara (dooni) let me suggest that the Manding tongues
also present another somewhat particular and complicated picture not
addressed for all uses by any of the ISO-639 codes. There is one ISO-639-1
code (bm for Bambara), 4 ISO-639-2 codes (in addition to bam for Bambara,
there is dyu for Jula which some treat as almost the same as Bambara, man
for "Mandingo" which ISO-639-3 treats as a "macrolanguage" grouping of
Maninka and Mandinka tongues, and the new nqo for Nko which is a script, a
social movement, and an effort to standardize Manding language). Obviously
none of this was thought through beforehand except for the ISO-639-3 coding
of about a dozen languages. Where does an effort like Wikipedia turn for
answers? It may again be a matter of working with diverse expert opinions
and consulting available categories to find a solution that responds to the
Debbie brought up ISO-639-6 which will also be well thought through but it
is even more specific. There is also ISO-639-5 on the other end of the scale
and I'm curious about the anticipated relationship between it and -1&2. But
these are a ways down the road. In the end maybe the overall system will
accommodate flexible responses, and if so that's all the more reason not to
limit future options.
This has rambled on rather more than I intended. Let me use a different
analogy. When putting a piece of equipment together one learns early on not
to tighten the screws or bolts all the way before you get the whole thing
assembled. This situation seems somewhat analogous. I don't think it is
entirely clear how all these coding systems will actually line up with each
other or (more problematic) with the complex realities they are intended to
portray. So any application of them needs to have some flexibility, or
perhaps to be machined a bit here and there to work.
(I may be way off but that's my inflated 2 cents worth)
From: ietf-languages-bounces at alvestrand.no
[mailto:ietf-languages-bounces at alvestrand.no] On Behalf Of Debbie Garside
Sent: Sunday, November 12, 2006 1:17 PM
To: gerardm at wiktionaryz.org; ietf-languages at iana.org
Subject: RE: Wikimedia language codes
Your need is for ISO 639-6. This part of the ISO 639 family is due for
publication in January 2008.
Editor ISO DIS 639-6
> -----Original Message-----
> From: ietf-languages-bounces at alvestrand.no
> [mailto:ietf-languages-bounces at alvestrand.no] On Behalf Of Gerard
> Sent: 12 November 2006 18:08
> To: ietf-languages at iana.org
> Subject: Wikimedia language codes
> I have said a few things in another mail thread and I think it is
> helpful when I explain what I am looking for and what my current
> issues are. In this mail I will only address needs that we have in the
> Wikimedia Foundation.
> *==Wikimedia Foundation==*
> The Wikimedia Foundation has at this moment in time exactly 250
> different Wikipedia projects. Some of them have a code that is
> incompatible with any ISO-639 code of any version.
> There are projects that have codes that are squatting on existing
> ISO-639 codes. There are codes that have been made up that currently
> do not trespass on what are the codes of other languages however, I
> ISO-639 codes. My understanding is that it is not permitted to use
> codes that can be mistaken for valid codes.
> As there is now a "language sub-committee" in the Wikimedia
> Foundation, and as it is our brief to come up with recommendations for
> the creation of new projects and as the CTO of the Wikimedia
> Foundation is not pleased with this situation, one of the tasks in
> front of us is to come up with the appropriate codes for the existing
> projects. This is not simple and it is certainly not straight forward.
> One of the disputes is about the Belaruse wikipedia that has been
> squatted by people who insist on using an orthography that is not the
> official one. There is a vibrant group of Belaruse using the official
> orthography that wants to claim on the same domain. This is one among
> many, most are largely political.
> One of our problems is not solved because you do not consider the
> ISO-639-3 "official". This is the existence of a Wikipedia in
> What we do understand is that none of the ISO-639-3 codes will ever
> be used other then for its defined purpose.
> An often recurring theme in our request for new projects is that
> people claim that something is a language. It happens regularly that
> the proponents point to what should be amounts of impressive content
> either in archives, libraries on the Internet, all stuff that is to
> most of us goobledegook.
> Often it is claimed that they have applied for recognition for their
> language. It does not make sense to request it from anyone but
> Ethnologue as the ISO-639-2 is at its end of life.
> There was some earlier discussion of the Min-Nan language on this
> mailing list. For your information both the Min-Nan Wiktionary and
> Wikipedia are not in either the Hant or the Hans script, it uses
> Latn.When you start off from zh as the basis you insist on and equally
> the people who write Min-Nan without exception use Latn, the code
> zh-nan-Latn is not logical at all. NB these are really active
> For the Wikimedia Foundation there are a number of options;
> * We use our WMF language codes internally and externally. This
> imho from a standards point of view a worst case scenario
> * We use our codes internally and externally we advertise the
> "official" codes.
> * We sanitise our codes so that there is at least no conflict with
> the ISO-639 codes. We use them internally and we advertise the
> "official" codes.
> * We move away from our current codes and only use "official"
> both internally and externally.
> It is as difficult to make the Wikimedia Foundation move as it is to
> get movement about Standards I suspect. I think it we need a plan how
> this can be solved. There are at least two lists I would like to have
> that would help:
> * A list with the all the ISO-639 codes (1, 2 and 3) and the codes
> that these languages have under RFC 4646.
> * A list with the WMF language codes and the language codes under
> RFC 4646.
> I am sure that the first list exists. With this list it is possible to
> compile the second list. For some WMF language codes we may need to
> ask for tags to identify them properly by their dialect, orthography
> or whatever makes them special.
> Gerard Meijssen aka GerardM
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
Ietf-languages mailing list
Ietf-languages at alvestrand.no
More information about the Ietf-languages