Unilingua

John.Cowan jcowan at reutershealth.com
Sat Sep 17 05:33:44 CEST 2005


Tex Texin scripsit:

> In another sphere we have a small number of character encodings, and we
> can't get software to properly identify the encoding in play. Why should we
> believe that with thousands of language codes available they will be used
> properly?

Encodings and languages just aren't comparable, and that for two reasons.  First,
encodings are below the radar of ordinary people, languages are not.  If you
ask an author "What encoding is your letter/paper/memo/report in?", they
will probably answer "Say what?"  But if you ask, "What language is your
document written in?" you get a sensible answer, like "English", "French",
"Japanese", or "Navajo".

Second, encodings are critical: if the encoding is wrong, the document
contains mojibake of one sort or another.  If your Norwegian document
is mistaken for French, the worst that happens is that it appears to be
full of spelling errors or is pronounced wrongly by a text-to-speech
program -- which aren't very reliable anyhow.

> Even with the small number of codes we have today, I have difficulty
> determining which code properly describes a document.

There are two separate problems here:  (a) What language is the document in?
(b) What is the proper code for that language?

For the first one, there is nothing you can do except get the answer from
the author, per above.  There are no reliable third-party tests for distinguishing
one language from another.

To determine which language tag to apply, one needs to match the author's idea
of the language with the canonical name of the language.  This works well for
widely known languages like English, but not so well for various Caribbean
creoles.  Nevertheless, one can with the application of good will work out
the answer.

Note that these difficulties are entirely social, not technical.

> There are no
> guidelines or rules or ways to determine whether a document is one branch of
> a language versus another, except with the crudest of guesses. Various
> experts make pronouncements about Japanese being ja and not ja-jp, or latn
> not being required for en, since en is not generally represented in another
> script, but only an expert knows all of the possibilities and which
> circumstances never (or nearly never) occur, and which ones require
> additional descriptors or not. Given that is the case, I really don't need a
> more refined set of language choices.

The rules are laid down in RFC 3066bis in more detail, but what it amounts to
is, tag sensibly.  Since there are no other national varieties of Japanese,
the tag ja-jp makes little sense (it's probably a garbled version of the
*locale* tag ja_JP, which is quite a different matter).  The RFC 3066bis
registry will contain explicit guidance on which languages should not
have script tags when written in their usual scripts: some languages, however,
don't have a single standard script.

> If and when someone gives me a way to review a document and determine the
> proper language tag,

If you don't know what language it's in and what the context is, you can't
tag the document at all.

> and we all agree on the right tag, and it doesn't
> require three linguists to do the determination, I'll believe we have a
> system worth all these refinements. Oh, and I also need to believe the
> distinctions are something that my application may utilize.

Documents should be tagged accurately even if a particular application can't
make use of the information.  Accuracy is not always precision, however.
It may indeed require a linguist to tag a document written in some obscure
language that has not been written before, but what's the alternative?

> I understand that for some very few purposes the ability to distinguish
> between thousands of languages is useful.

> I just don't see that most users,
> or most applications need it, and most content providers are incapable of
> correctly tagging their content. So I don't see why we should burden general
> applications with it.
> 
> So what good has it done that we have registered Boontling? For all the web
> pages and applications that do something with boontling, was the world
> really much better than if we had left them on their own with x-boontling?

Because en-boont is basically en with strange vocabulary, and that fact
matters.  x-boont could be *anything*.

> Is the world so much better that we registered boontling and denied or
> delayed es-americas?

Fixed.

> The ISO 639 standards serve their purposes for linguists. The majority of
> software on the internet does not require this level of distinction and does
> not need to be burdened with it and I don't see that 3066bis will be
> deployed the way it has been envisioned. 

Every tag we've approved for some time now has been 3066bis compliant.
This will only go on, and the demand will accelerate.  The point of
3066bis is to provide ways and means for people to have the tags they
need without bothering us.

-- 
John Cowan  jcowan at reutershealth.com  www.reutershealth.com  www.ccil.org/~cowan
If a traveler were informed that such a man [as Lord John Russell] was
leader of the House of Commons, he may well begin to comprehend how the
Egyptians worshiped an insect.  --Benjamin Disraeli


More information about the Ietf-languages mailing list