Wikimedia language codes

Wed Nov 22 00:24:09 CET 2006

I've been sick, so a little slow in responding.

> -----Original Message-----
> From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> bounces at alvestrand.no] On Behalf Of Don Osborn
> Sent: Monday, November 13, 2006 9:40 AM

> What concerns me first of all is that bm/bam is so close
> to dyu that (1) for instance a localization effort in Burkina Faso is
> talking about Bambara (bm) rather than Jula (dyu), and that (2) there is no
> code to cover that.

This touches on one of the challenges in language coding: Languages are not static objects with hard edges. They have fuzzy boundaries, and they evolve over time. When language *usage* is well established, things are a bit easier; e.g. English is a heavily used language and has been so for a long time, and so there is a conventional notion of 'English' and almost always agreement on when a document can be considered to be in 'English'. In many parts of the world where development has happened at a slower pace, the language usage situation may still very much be unfolding. That can mean that the varieties in use are still evolving with a common / 'language of wider communication' still coming into being, and it can also mean that perceptions as to where the divisions lie are still evolving. 

Another factor in all this is knowledge about a network of language varieties and the usage thereof: at any point, if we have an inventory of language identifiers, that set reflects a certain amount of knowledge that may well be incomplete. With increased knowledge, there's every possibility that our thoughts on the inventory that is needed will change.

All that to say, I don't know enough to comment on languages of Burkina Faso and its neighbors or about the history that led to the inventory that we have -- one that includes bm and dyu but nothing that encompasses the two of them. Perhaps the language varieties or their usage have been evolving, or perhaps knowledge has been incomplete, or perhaps both.

These are part of the reality that we face in language coding, and the challenge for us is how to go about coding in a way that makes allowance for that reality.

Perhaps a macrolanguage entity would be useful in this particular case (again, I don't know enough to know if that is true or not). As John indicated, nothing prevents new macrolanguage entities from being encoded. 

> Another concern - on a higher level - is that there is no code to cover the
> macro-macrolanguage (if you will) that would include the man ("Mandingo")
> macrolanguage and the bm + dyu macrolanguage-without-a-tag. This is not pure
> theory - the Manding languages are close. Linguists will point out
> differences, speakers will recognize them, but in some ways the ensemble is
> like Fula ff but in a more concentrated geographic area of West Africa.

The semantics of the identifier 'man' is an open issue that the JAC still needs to finalize (and one I'll be raising very soon). Currently, the draft code tables for 639-3 shows man as a macrolanguage that encompasses seven individual languages, but that is just one possible proposal. Would it make sense to have this as a macrolanguage that encompasses other individual-language entries as well, including bm and dyu?

> Akan/Twi/Fanti, all already in 639-1, has
> pretty much been settled. 

You say it has been settled: by whom and with what conclusion? (This is one of the open issues I need to get the JAC to make a decision on very soon.)

> But the extent to which new
> "macrolanguages" - a category that is already by accident of history (so to
> speak) under 639-2 - would be added to 639-3 but not the latter creates more
> confusion (at least for this particular human). 

It is perhaps reasonable to consider what is in 639-2 an accident of history, but I wouldn't get hung up on that. Think of things this way:

- there is a single alpha-3 codespace
- entities coded in that codespace include individual languages, macrolanguages and language collections
- 639-2 happens to be a subset of those entities that is deemed to be of interest to some particular user communities

Given that, appropriate questions to be considering are:

- Are there individual languages missing from 639-3?

- Are there cases in which a macrolanguage not currently in 639-3 would be useful for certain applications?

- Are there entities in 639-3 or 639-5 that are not in 639-2 but that would be useful to the user communities to which 639-2 is targeted?

Peter