Script codes in RFC 3066
Caoimhin O Donnaile
caoimhin at smo.uhi.ac.uk
Wed Apr 9 22:21:21 CEST 2003
On Wed, 9 Apr 2003, John Cowan wrote:
> > a system of atomic (unstructured) codes together with an online
> > database giving:
> >
> > - hierarchic information for all extant codes
>
> What kind of hierarchic information are you referring to: below the language
> level (like en-us vs. en-ie) or above it? If the latter, I agree; there is
> no need to tag that. But it is useful to see at once that en-us and en-ie have
> some degree of interoperability (if we met, I'd probably understand your
> English) which is appropriately expressed by a tag hierarchy.
I was thinking mostly of hierarchic information above the language
level - e.g. recording the fact that Scottish Gaelic is a Goidelic
language, which in turn are a branch of the Celtic languages, which
in turn are a branch of the Indo-european family.
However, my inclination would be extend the "database" mechanism
slightly below "language" level to encompass major dialects, partly
because I think it will be a useful mechanism, partly because opinions
may change as to what is and is not a separate language - e.g.
"Norwegian/Bokmal/Nynorsk", "English/Scots/Ulster-Scots",
"Serbo-Croat/Serbian/Croatian/Bosnian".
The main question is whether it is sensible to assume that all
language-aware software will in the future work with such a
"languages database". The database would be quite small - tens
of kilobytes, and even less if compressed - and software which did
not try to be all-encompassing could cache a small part of it
and generate an Internet query for anything else.
The database could also store all sorts of other useful information
apart from language family hierarchies. e.g.:
- "is a sign language"
- "is an artificial language"
- "is believed to be extinct"
- "is a South American language"
- "is usually written in Cyrillic script"
I mentioned before codes for historical languages from different
periods as eating into the codespace, as well as obsoleted codes
as scholarship progresses, but I forgot to mention another major
factor. I think that the codes for nodes in the hierarchy should
be in the same codespace, because at the lower levels opinions
may change over time as to what is a "node" and what is a separate
language (as in the examples above), and because it as well to keep to
the same system for the higher levels too.
It seems a pity to me that 4-letter semi-mnemonic codes were not
used for the languages codespace and 3-letter codes for scripts,
rather than the other way round. But I guess it is too late to
change that now.
Caoimhín
More information about the Ietf-languages
mailing list