ISO 639 and other language identifiers

Caoimhin O Donnaile
Tue, 7 May 2002 19:36:07 +0100 (BST)

John Cowan said:

> I think it is Tai-Kadai where there is pretty good agreement which
> languages belong to the family and which don't, but there is a major
> faction fight about how those languages are to be grouped.

In such cases I suppose the database could be "conservative" and
point straight from the language to "Tai-Kadai" as the parent node,
until such time as there were better agreement and intermediate
nodes could be added.

> Everyone (or almost everyone) agrees that English and Frisian form a
> sister group (there are no other languages at the same level, unless
> you count Scots as distinct), but what conceivable audience either
> machine or human would be indifferent as between English and Frisian
> but absolutely exclude Dutch?

English and Frisian is an unusual grouping to want to use, I agree.
(And actually the current version of the Ethnologue does not group
them - the "North Sea" node which was in the previous version has gone -
although this does illustrate the point that nodes are subject to

However, English plus Scots is a label which people might very well
want to use.  (Scots already has a separate code in both the
Ethnologue and ISO 639-2.)  Someone searching for dialect words might
want to search over both.  Or someone might have a collection of
text or speech samples from Scotland in which it was unclear which
samples could best be described as "Scots" and which as "Scottish
English".  He might be happiest labelling them with the nearest node
which embraced both (node 765 in the current Ethnologue).

Languages too, like nodes, are subject to disagreement and change,
but that does not stop us from trying to assign codes to them.

According to the current (and previous) version of the Ethnologue,
Frisian consists of three languages: Western Frisian, Northern
Frisian and Eastern Frisian.  ISO 639-2 currently has only one code,
"fry", even though ISO 639-2 makes finer language distinctions than
this at times, as between French and Walloon, and between Croatian,
Bosnian and Serbian.  What happens when the requests come in for
separate language codes?  Does "fry" then become deprecated?  Or does
it remain valid as a useful grouping?

Even more dramatic is the case of Nedersaksisch, which has only one
code, "nds", in ISO 639-2, but 13 language codes in the current
Ethnologue.  It would be very tedious to have to specify all 13 codes
in a search.

What I am saying is that codes for nodes are already in use to some
extent and that it would be better to recognise this and to formalise
it.  I'm also suggesting that an online database with a stable address
and search syntax might be a good way to help browsers and search
engines and other software make sense of all this.

I suppose the database could be very conservative and merely record
that "Western Frisian" is a subset of "Frisian" is a subset of
"Germanic" is a subset of "Indo-European".  However, I think that
intermediate node information could be very useful and it would be
a shame if it were not included too, even though the nodes would
probably have to carry various degrees of "speculative" warnings.