Wikimedia language codes

Mon Nov 13 18:40:00 CET 2006

Hi John, Thanks for the feedback. A few follow-ups in text, the last one is
a bit windy...

> -----Original Message-----
> From: John Cowan [mailto:cowan at ccil.org]
> Sent: Monday, November 13, 2006 1:47 AM
> To: Don Osborn
> Cc: ietf-languages at iana.org
> Subject: Re: Wikimedia language codes
> 
> Don Osborn scripsit:
> 
> > There are cases where I think the ISO-639-3 codes would definitely
> not
> > be ideal for localization or for Wikipedia editions, for instance.
> > Maybe ISO-639-1/-2, or -5 would be a more appropriate grouping.
> 
> You will always be free to use broader 639-1/2 codes (1 is always
> preferred to 2 when both are available) rather than narrower 639-3
> codes if you want.  The question of 639-5 has not yet arisen, and as
> far as I know no one is proposing to add it to RFC 4646bis or any
> successor.  If you'd care to make the case for it, I'd like to see it.

I'm trying to find out more about what has already been proposed for 639-5.
Any references would be appreciated (Googling gets articles that mention it,
but not much more)

...
> 
> > In fact, it turns out that since 1990 a standardized version for all
> 4
> > has been developed called Runyakitara. It is not yet coded in -2 or -
> 3
> > (and actually might be considered a "macrolanguage" and thus a
> logical
> > candidate for ISO-639-2). This information is not apparent from any
> of
> > the available codes.
> 
> Just to clarify, although all existing macrolanguages have 639-2 codes,
> there's no requirement that future macrolanguages be encoded in 639-2
> as well; Runyakitara might be added as a macrolanguage in a future
> version of 639-3.

This is interesting. As a relative newcomer to all this my impression was
that there was a sort of hierarchical logic inherited in part from the
pre-existence of 639-1&2 and reflected in the brief descriptions of the 639
ensemble. Your clarification of the nesting of tags in the case of Fula and
a similar mention with regard to Arabic seem to show that logic in action.

On the other hand I suppose from the computer system's perspective it
doesn't matter under what ISO-639 list the humans organize the alpha-3's but
that raises the question about the overall schema envisioned. (More below)

...
> 
> > Since I also speak Bambara (dooni) let me suggest that the Manding
> > tongues also present another somewhat particular and complicated
> > picture not addressed for all uses by any of the ISO-639 codes. There
> > is one ISO-639-1 code (bm for Bambara), 4 ISO-639-2 codes (in
> addition
> > to bam for Bambara, there is [...]
> 
> To clarify again:  bm means the same as bam, and therefore 'bam' is not
> a valid RFC 4646 language subtag.

Sorry if my excess verbiage obscured my point (not a first, that). Let me
try with more verbiage, hopefully better organized...

I am aware that bm & bam are the same and that the alpha-2 from ISO-639-1 is
used in preference. What concerns me first of all is that bm/bam is so close
to dyu that (1) for instance a localization effort in Burkina Faso is
talking about Bambara (bm) rather than Jula (dyu), and that (2) there is no
code to cover that. 

Another concern - on a higher level - is that there is no code to cover the
macro-macrolanguage (if you will) that would include the man ("Mandingo")
macrolanguage and the bm + dyu macrolanguage-without-a-tag. This is not pure
theory - the Manding languages are close. Linguists will point out
differences, speakers will recognize them, but in some ways the ensemble is
like Fula ff but in a more concentrated geographic area of West Africa.
Moreover, N'ko nqo proposes in part a role as a sort of standard Manding
tongue, underscoring that historic and linguistic unity.

The practical implications are thus: 
1. One might envision some limited localizations in a pan-Manding version:
OpenOffice or MSOffice (it may be that the Manding variations are not as
daunting as those encountered with Inuktitut which is the target of an MS
localization, for instance)? A Wikipedia gateway to other more specific
Manding languages content?? The only code that might apply - and in present
use only in connection with a particular script - is nqo in 639-2. There is
no other code for Manding (see also #2, next)
2. There already seems to be a need for something to cover bm+dyu. This
would be on the same level as the man macrolanguage tag. These might be
appropriate for localization work in (very roughly) the western and eastern
parts of the Mandingophone range.

There are other such cases in Africa that vary mainly on the specifics.
Runyakitara has been mentioned. Akan/Twi/Fanti, all already in 639-1, has
pretty much been settled. Kwanyama/Ndonga are very close and have full sets
of ISO-639 codes - there is not currently to my knowledge an interest in
treating them as a unit in any IT applications, but if there were to be, it
is logical grouping that would merit appropriate recognition. Kirundi and
Kinyarwanda are virtually the same, so I understand, but for reasons of
state retain entirely separate codings (with no overarching reference that
would include them and closely related varieties in southern Uganda). And so
on...

Some of the IT uses are currently hypothetical I admit, but the linguistic
realities are there now (some research by the South African based NGO,
CASAS, focuses on understanding those better), and the IT ones are on the
way. How the language coding accommodates these realities is something I'd
hope to answer sooner rather than later - an ad hoc approach to adding codes
when requested may work, but arguably works better with a clear system.

It may be that 639-5 is a non-issue (loose paraphrase of an offline message
from another group member and noting your mention that as far as you know
"no one is proposing to add it to RFC 4646bis or any successor") but it
would have seemed to be a useful element in the ISO-639 schema given the
kinds of situations I've outlined. In some ways there seems to be an overlap
with 639-2 - which is not news I realize. But the extent to which new
"macrolanguages" - a category that is already by accident of history (so to
speak) under 639-2 - would be added to 639-3 but not the latter creates more
confusion (at least for this particular human). So codes for groupings of
languages ("languages" by the definition used for 639-3) might exist in
639-2 or be newly registered in 639-3 while on paper they are the subject of
639-5?

Above I've cited examples only from Africa - though with ~1/3 of the world's
language "only" is not of course a minimization of importance. Are there
points of reference in other regions that might help clarify the issues
raised?

Clearly I need to do more reading. I appreciate your patience in responding
to these kind of questions.

(To group members): Send your suggested reading lists with URLs to me
offline. TIA.

All the best.

Don