ISO 639 and other language identifiers

Wed, 8 May 2002 00:59:00 +0100 (BST)

John Cowan wrote:

> However, there again is no reason, given a sample of text or sound or
> video, to classify it as merely Tai-Kadai, nor is there any reasonable
> human or computer process that can accept any and all Tai-Kadai texts
> etc.  So the node, while unimpeachably correct, is not useful.

Say a library acquires a collection of texts or sound samples from
south-east Asia.  It might not have the time or resources to determine
precisely which language they belong to, in which case it would probably
be better to label them as "Tai-Kadai" if this is known, rather than
to leave the language completely unspecified.

A database carrying hierarchic and node information and retaining
basic information on obsoleted codes would also mean that the
language tag information on documents would not suddenly become
completely useless if the language code became obsolete (a common
occurence).  You would at least be able to determine the nearest
existing subfamily which included the obsoleted code.  A library
might use this to automatically recode documents with a subfamily
code if it did not have the resources to work out the appropriate
language code which now applied.

If a language code becomes obsolete because this language is
split up into separate languages, under the present scheme of things
all documents labelled with that code suddenly have no valid
language tag.  If nodes had codes which were part of the same
namespace as languages, then sometimes if a language got split
up, it would simply become a subfamily, the same code could be
retained, and the documents would still have a valid tag.

I would find it very useful indeed for my European Minority Languages
page: http://www.smo.uhi.ac.uk/saoghal/mion-chanain/en/
if there were standard codes for subfamilies.  I want to be able to
hang information onto subfamily nodes to get it as near as possible
to where it will be useful.  e.g. I provide a link to Orbis Latinus
under "Romance".  I link from the subfamily nodes to the Ethnologue,
and find these links very useful in my own work.  At present I use
the Ethnologue's numeric node numbers to link to the Ethnologue,
and abbreviated names for internal links in the page.  As Peter
Constable pointed out, the Ethnologue's node numbers are totally
transient and unsupported and I'll have to change all the links when
the next edition of the Ethnologue comes out.  And I have to be very
careful with my abbreviated names, because the Ethnologue's subfamily
names may be identical to language names, and there are even two
different subfamilies both called "Allemannic".  I would much rather
have unique subfamily codes which I could use, even though I realise
that these will of course be subject to change and become obsoleted.

> > Even more dramatic is the case of Nedersaksisch, which has only one
> > code, "nds", in ISO 639-2, but 13 language codes in the current
> > Ethnologue.  It would be very tedious to have to specify all 13 codes
> > in a search.

> I'm confused.  ISO "nds" is precisely SIL's SAX, which has many
> names.  Why do you think it is 13 languages?

Sorry, you are right - although the Ethnologue code is SXN, not SAX.
What confused me is that the Ethnologue has a subfamily called
"Low Saxon", which contains 13 languages, one of which is
"Saxon, Low", with a code of SXN.  (Others are Westphälisch,
Gronings, Plautdietsch, etc.)  I see that the analysis at:

  http://www.ethnologue.com/iso639/analysis.asp

maps nds to SXN - although I wouldn't be too sure (maybe there is
some way of checking?) whether this narrow definition is what was
in mind when nds was registered in ISO 639-2 two years ago.

Caoimhín