Unilingua

Sat Sep 17 17:31:14 CEST 2005

Hopefully this is unnecessary, as I expected that Tex is out on a limb
in his views compared to most members of this list.  But just in
case it is necessary, I'll say that I strongly disagree with his last
message.

> I understand that for some very few purposes the ability to
> distinguish between thousands of languages is useful. I just don't see
> that most users, or most applications need it, and most content
> providers are incapable of correctly tagging their content. So I don't
> see why we should burden general applications with it.

I have just labelled nearly all of the thousands of pages on our 
website, www.smo.uhi.ac.uk, with:
    lang=gd - Scottish Gaelic (most of them)
    lang=ga - Irish Gaelic
    lang=gv - Manx Gaelic
    lang=en - English
as appropriate.

I look forward to the day - hopefully very soon, when I can use
Google to find pages in Scottish Gaelic.  Scottish Gaelic has only
about 50,000 speakers but it has not been denied a language code.
There are probably thousands of languages with more speakers than
that and I don't see why they should be denied language codes either,
and I do not believe that they should have to go through a process
of individually finding a free code, submitting a request, naming
50 texts in the language and so on.  I don't see why Manx Gaelic,
with only a few hundred fluent speakers and struggling for survival
should not be given a code with which they can get on with labelling 
their documents and finding them with Google - nor any other language 
either in a similar position.  It seems to me to be a disgrace that 
there is still no computing standard in place assigning codes to most of 
the world's languages, despite the existence for many years of the 
Ethnologue database.

> In another sphere we have a small number of character encodings, and
> we can't get software to properly identify the encoding in play. Why
> should we believe that with thousands of language codes available they
> will be used properly?

Character encodings are now on their way out - fast!  Languages are here
to stay - hopefully.  In a couple of years the only character encoding
worth bothering much about will be Unicode - probably utf8.  Whatever
might have been desirable in the past, the emphasis now should be on
getting documents in obscure encodings converted to Unicode, rather
than on getting software to recognise obscure encodings.

That's how things look to me anyway.

Caoimhín Ó Donnaíle