ISO 639 and other language identifiers
Caoimhin O Donnaile
Tue, 7 May 2002 00:32:12 +0100 (BST)
I have no expertese other than that gained from maintaining a
"European minority languages" page on a voluntary basis, and am not
involved in any standards work, but here are my naive impressions
of what needs to be done:-
1. There needs to be an immediate mass registration of languages.
I would just register all the languages in the Ethnologue.
2. Not only the "languages", but also the nodes in the language
family trees need to be registered. This is both to avoid
the thorny political questions of what is and what isn't a language,
and for convenience. Someone with a good knowledge of Nedersaksisch
(Low Saxon) should be able to specify that they are happy
to accept web-pages in Nedersaksisch without having to specify
all 13 languages which the Ethnologue currently divides it into.
3. For future flexibility and to avoid questions of what is and
what isn't a language, the nodes and leafs should all be part
of the same system - the same address space. This would mean
that nodes which contain only a single language would not be
needed and would disappear - e.g. node 1267 and code JPN for
Japanese in the current Ethnologue would merge. Such nodes seem
to exist only to maintain "nodes" and "languages" as separate
4. The system should also include extinct and historical languages -
e.g. "Middle English", "Middle Irish", "Old Irish", "Classical
Latin", "Mediaeval Latin".
5. A hierarchichal naming system is not possible. Any hierarchy is
too unstable and subject to change. So the entities need to have
distinct independent identifiers.
6. There are not enough three-letter combinations to cover all the
above requirements. I think a new set of four-letter codes
should be devised on the same mnemonic principles as the
current Ethnologue. At the same time, the opportunity should
be taken to use codes which are mnemonic in the language itself
rather than in English - e.g. "DEUT" for German rather than "GER".
(I believe that this is mostly only an issue for for European
languages, which were known to English speakers for long enough
for separate English names to have developed, different from the
7. The system of language codes needs to be backed up by an
online database, like the present Ethnologue but better.
It should be revised on a continuous basis. The URL addresses
should be guaranteed to be available "for all time". They
should be a bit simpler than the present Ethnologue addresses
8. As well as the textual "web-page" type information contained in
the present Ethnologue, the online database must make available
standardised "relational database" type information via SQL queries.
If a web page is labeled as "Twents" and a user has declared that
they are happy to accept "Nedersaksisch" then a browser should be
able to determine from the database that "Twents" is a subset of
"Nedersaksisch". Because there is now so much port-blocking for
security reasons, such SQL queries would probably have to be
transmitted using HTTP protocol.
9. As knowledge about languages increases, there are bound to be
lots and lots of changes. Old nodes and language codes will
disappear and new ones be created. Old codes must never be
reused. Although most of their information in the database may
no longer be maintained, the code must still is exist in the
database, and one piece of information must be maintained -
namely the closest existing node which the old code was a member of.
E.g. Even if the Lower Saxon languages are reorganised and the
code for Twents becomes deprecated, there must be a record in
the database showing that it existed and is a member of the
existing category "Nedersaksisch".
Caoimhín Ó Donnaíle