Script codes in RFC 3066

Wed Apr 9 22:21:21 CEST 2003

On Wed, 9 Apr 2003, John Cowan wrote:

> > a system of atomic (unstructured) codes together with an online
> > database giving:
> >
> >   - hierarchic information for all extant codes
>
> What kind of hierarchic information are you referring to: below the language
> level (like en-us vs. en-ie) or above it?  If the latter, I agree; there is
> no need to tag that.  But it is useful to see at once that en-us and en-ie have
> some degree of interoperability (if we met, I'd probably understand your
> English) which is appropriately expressed by a tag hierarchy.

I was thinking mostly of hierarchic information above the language
level - e.g. recording the fact that Scottish Gaelic is a Goidelic
language, which in turn are a branch of the Celtic languages, which
in turn are a branch of the Indo-european family.

However, my inclination would be extend the "database" mechanism
slightly below "language" level to encompass major dialects, partly
because I think it will be a useful mechanism, partly because opinions
may change as to what is and is not a separate language - e.g.
"Norwegian/Bokmal/Nynorsk", "English/Scots/Ulster-Scots",
"Serbo-Croat/Serbian/Croatian/Bosnian".

The main question is whether it is sensible to assume that all
language-aware software will in the future work with such a
"languages database".  The database would be quite small - tens
of kilobytes, and even less if compressed - and software which did
not try to be all-encompassing could cache a small part of it
and generate an Internet query for anything else.

The database could also store all sorts of other useful information
apart from language family hierarchies.  e.g.:
   - "is a sign language"
   - "is an artificial language"
   - "is believed to be extinct"
   - "is a South American language"
   - "is usually written in Cyrillic script"

I mentioned before codes for historical languages from different
periods as eating into the codespace, as well as obsoleted codes
as scholarship progresses, but I forgot to mention another major
factor.  I think that the codes for nodes in the hierarchy should
be in the same codespace, because at the lower levels opinions
may change over time as to what is a "node" and what is a separate
language (as in the examples above), and because it as well to keep to
the same system for the higher levels too.

It seems a pity to me that 4-letter semi-mnemonic codes were not
used for the languages codespace and 3-letter codes for scripts,
rather than the other way round.  But I guess it is too late to
change that now.

Caoimhín