ISO 639 and other language identifiers

Tue, 7 May 2002 00:32:12 +0100 (BST)

I have no expertese other than that gained from maintaining a
"European minority languages" page on a voluntary basis, and am not
involved in any standards work, but here are my naive impressions
of what needs to be done:-

1. There needs to be an immediate mass registration of languages.
   I would just register all the languages in the Ethnologue.

2. Not only the "languages", but also the nodes in the language
   family trees need to be registered.  This is both to avoid
   the thorny political questions of what is and what isn't a language,
   and for convenience.  Someone with a good knowledge of Nedersaksisch
   (Low Saxon) should be able to specify that they are happy
   to accept web-pages in Nedersaksisch without having to specify
   all 13 languages which the Ethnologue currently divides it into.

3. For future flexibility and to avoid questions of what is and
   what isn't a language, the nodes and leafs should all be part
   of the same system - the same address space.  This would mean
   that nodes which contain only a single language would not be
   needed and would disappear - e.g. node 1267 and code JPN for
   Japanese in the current Ethnologue would merge.  Such nodes seem
   to exist only to maintain "nodes" and "languages" as separate
   systems.

4. The system should also include extinct and historical languages -
   e.g. "Middle English", "Middle Irish", "Old Irish", "Classical
   Latin", "Mediaeval Latin".

5. A hierarchichal naming system is not possible.  Any hierarchy is
   too unstable and subject to change.  So the entities need to have
   distinct independent identifiers.

6. There are not enough three-letter combinations to cover all the
   above requirements.  I think a new set of four-letter codes
   should be devised on the same mnemonic principles as the
   current Ethnologue.  At the same time, the opportunity should
   be taken to use codes which are mnemonic in the language itself
   rather than in English - e.g. "DEUT" for German rather than "GER".
   (I believe that this is mostly only an issue for for European
   languages, which were known to English speakers for long enough
   for separate English names to have developed, different from the
   indigenous names.)

7. The system of language codes needs to be backed up by an
   online database, like the present Ethnologue but better.
   It should be revised on a continuous basis.  The URL addresses
   should be guaranteed to be available "for all time".  They
   should be a bit simpler than the present Ethnologue addresses
   (".../show_family.asp?subid=1267", ".../show_language.asp?code=JPN")

8. As well as the textual "web-page" type information contained in
   the present Ethnologue, the online database must make available
   standardised "relational database" type information via SQL queries.
   If a web page is labeled as "Twents" and a user has declared that
   they are happy to accept "Nedersaksisch" then a browser should be
   able to determine from the database that "Twents" is a subset of
   "Nedersaksisch".  Because there is now so much port-blocking for
   security reasons, such SQL queries would probably have to be
   transmitted using HTTP protocol.

9. As knowledge about languages increases, there are bound to be
   lots and lots of changes.  Old nodes and language codes will
   disappear and new ones be created.  Old codes must never be
   reused.  Although most of their information in the database may
   no longer be maintained, the code must still is exist in the
   database, and one piece of information must be maintained -
   namely the closest existing node which the old code was a member of.
   E.g. Even if the Lower Saxon languages are reorganised and the
   code for Twents becomes deprecated, there must be a record in
   the database showing that it existed and is a member of the
   existing category "Nedersaksisch".

Any comments?

Caoimhín Ó Donnaíle