Wikimedia language codes
dzo at bisharat.net
Mon Nov 27 03:17:06 CET 2006
Hi Peter, Hope you're feeling better. Replies in text
> -----Original Message-----
> From: Peter Constable [mailto:petercon at microsoft.com]
> Sent: Tuesday, November 21, 2006 6:24 PM
> To: Don Osborn; John Cowan
> Cc: ietf-languages at iana.org
> Subject: RE: Wikimedia language codes
> I've been sick, so a little slow in responding.
> > -----Original Message-----
> > From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> > bounces at alvestrand.no] On Behalf Of Don Osborn
> > Sent: Monday, November 13, 2006 9:40 AM
> > What concerns me first of all is that bm/bam is so close to dyu that
> > (1) for instance a localization effort in Burkina Faso is talking
> > about Bambara (bm) rather than Jula (dyu), and that (2) there is no
> > code to cover that.
> This touches on one of the challenges in language coding: Languages are
> not static objects with hard edges. They have fuzzy boundaries, and
> they evolve over time. When language *usage* is well established,
> things are a bit easier; e.g. English is a heavily used language and
> has been so for a long time, and so there is a conventional notion of
> 'English' and almost always agreement on when a document can be
> considered to be in 'English'. In many parts of the world where
> development has happened at a slower pace, the language usage situation
> may still very much be unfolding. That can mean that the varieties in
> use are still evolving with a common / 'language of wider
> communication' still coming into being, and it can also mean that
> perceptions as to where the divisions lie are still evolving.
Well put. Without getting into a long discussion, one implication I see from
this fuzziness (and the different ways one might categorize "language") is a
flexibility to the language coding. A hard and fast "this is the language,
and this is its code" approach in some contexts may not be ideal. But some
common reference is needed.
> Another factor in all this is knowledge about a network of language
> varieties and the usage thereof: at any point, if we have an inventory
> of language identifiers, that set reflects a certain amount of
> knowledge that may well be incomplete. With increased knowledge,
> there's every possibility that our thoughts on the inventory that is
> needed will change.
True. But in some cases there is some good degree of knowledge, but it is
either not accessed (limitations of time & resources for those analyzing the
network as you put it, and limited involvement of other kinds of experts in
the language), or is interpreted in particular ways.
> All that to say, I don't know enough to comment on languages of Burkina
> Faso and its neighbors or about the history that led to the inventory
> that we have -- one that includes bm and dyu but nothing that
> encompasses the two of them. Perhaps the language varieties or their
> usage have been evolving, or perhaps knowledge has been incomplete, or
> perhaps both.
My guess is that there is a little of both, but moreso that Bambara and
Jula/Dyula are close (Jula just means trader or merchant in Manding, though
Julakan is much more than a trade language now). The case of Khassonke and
Bambara from what I've read might be the reverse (that is not so close
inherently, but a lot more dialect leveling or whatever it should be
called). In any event, I think that the experts among the native speakers
(trained linguists, researchers) should be more integrally involved in the
process of figuring out what to do with regard to issues like coding,
strategies for localization etc. Part of the reason they are not already has
to do with distance and communication issues (and some of the latter relate
to the prominence of French as the working language there while English is
the main working language for standards discussions).
This gets a bit off on a tangent, but let me mention that one of the
projects I would like to facilitate (or see done) would be a workshop in the
region to address such questions (under the guise of multidialect/multistate
languages in the region with perhaps two sections such as Manding and Fula).
> These are part of the reality that we face in language coding, and the
> challenge for us is how to go about coding in a way that makes
> allowance for that reality.
Can special projects such as what I just mentioned figure in such a process?
> Perhaps a macrolanguage entity would be useful in this particular case
> (again, I don't know enough to know if that is true or not). As John
> indicated, nothing prevents new macrolanguage entities from being
If it is a possibility, then it could be discussed by relevant experts,
> > Another concern - on a higher level - is that there is no code to
> > cover the macro-macrolanguage (if you will) that would include the
> > ("Mandingo") macrolanguage and the bm + dyu
> > macrolanguage-without-a-tag. This is not pure theory - the Manding
> > languages are close. Linguists will point out differences, speakers
> > will recognize them, but in some ways the ensemble is like Fula ff
> but in a more concentrated geographic area of West Africa.
> The semantics of the identifier 'man' is an open issue that the JAC
> still needs to finalize (and one I'll be raising very soon). Currently,
> the draft code tables for 639-3 shows man as a macrolanguage that
> encompasses seven individual languages, but that is just one possible
> proposal. Would it make sense to have this as a macrolanguage that
> encompasses other individual-language entries as well, including bm and
It might. In the absence of a workshop (!) I can pass this question on to
the Mande Studies list to sound out some people much more expert than I. It
will require that I explain things quite clearly (some won't need it; others
will benefit from a clearer context), but it will be good practice for any
kind of workshop if we can do it.
> > Akan/Twi/Fanti, all already in 639-1, has pretty much been settled.
> You say it has been settled: by whom and with what conclusion? (This is
> one of the open issues I need to get the JAC to make a decision on very
My presumption based on my understanding from some communications over
recent years, noting ISO-639-3 calling Akan a macrolanguage of the other two
(as memory serves), and reading things like the following:
"Another example is the way differences in dialects which have been
magnified in the past almost to the point of language status are now played
down considerably (e.g. Twi and Fante which are now seen simply as dialects
of Akan; these two speech forms were listed as separate languages in the
list of noine languages approved for education in post-Independence Ghana)."
Ayo Bamgbose, "Pride and Prejudice in Multilingualism and Development," In
Fardon & Furniss 1994 African Languages, Development and the State (London:
> > But the extent to which new
> > "macrolanguages" - a category that is already by accident of history
> > (so to
> > speak) under 639-2 - would be added to 639-3 but not the latter
> > creates more confusion (at least for this particular human).
> It is perhaps reasonable to consider what is in 639-2 an accident of
> history, but I wouldn't get hung up on that. Think of things this way:
What I meant was that the way things played out was an accident - no one
began with the idea of having something called "macrolanguage" AFAIK, but it
worked out to be necessary in the collision of -2 (which had its reasons for
being, if not any systematic logic [hope I don't get into trouble for
putting it that way]) and -3 (which has its own logic).
> - there is a single alpha-3 codespace
> - entities coded in that codespace include individual languages,
> macrolanguages and language collections
> - 639-2 happens to be a subset of those entities that is deemed to be
> of interest to some particular user communities
> Given that, appropriate questions to be considering are:
> - Are there individual languages missing from 639-3?
> - Are there cases in which a macrolanguage not currently in 639-3 would
> be useful for certain applications?
> - Are there entities in 639-3 or 639-5 that are not in 639-2 but that
> would be useful to the user communities to which 639-2 is targeted?
Good points and approach. In effect, alpha-3 is only that to the computer
(as it were) regardless of what section of ISO-639 we organize them under.
As such, definitions and codes from 2, 3, and 5 may be appropriate for
More information about the Ietf-languages