Unilingua

Mon Sep 19 15:58:48 CEST 2005

On 9/17/05, Tex Texin <tex at xencraft.com> wrote:
> So I suggest this represents an upper bound of sorts
> for how likely it is less motivated, less knowledgable, people will be to
> use proper language tags for less critical needs. So language tags will be
> more wrong on average than encoding tags.

In my experience with Project Gutenberg's Distributed Proofreaders,
I've spent a lot of time explaining encodings, and I don't think half
the people have any more clue than when I started. On the other hand,
I've once or twice had to discuss the difference between Greek,
Ancient and Greek, Modern and Scots, English and Middle English, but
it's not a continuing problem.

It will get a little more hairy when we move to XML and there's use of
language tags that aren't from a drop-down list, but I don't forsee
the problems you do. Sure the occasional quote is going to be marked
French when it should be marked Italian, but I don't see how any
tagging standard is going to prevent that.

> So yes there are guides in 3066bis, but I need an expert to interpret them,
> especially for languages beyond the few most popular ones.

So why do you keep talking about Japanese? Navaho is one of the
easiest language in the world to tag; there's one tag, nv, and no
reason to use anything else. Same with dozens of other languages.
Sure, it's hard to clearly identify a few of the minor languages, but
you probably don't need an expert to interpret them; you probably will
never encounter an ambigious case, because they're all so rare.

English in a broad sense has four tags: ang, enm, sco and en, and a
huge collection of dialects. Tagging English text is subject to a host
of ambiguities; when, if ever, is it appropriate to use en alone? If
Dickens, or worse Chaucer, is published in modern American spelling,
how should they be tagged?  But that's one of the world's most popular
languages. Chinese has its own issues, but that's another of the
world's most popular languages. We could add all the conlangs we
wanted, and it wouldn't add or subtract from those issues one bit.

> The point was we need not have bothered the whole internet
> community with the request nor burdened it with the ongoing weight of
> remembering that once somebody needed a boont tag. 

I don't understand how it's a burden.

> I don't see how this can
> be used by non-linguists to tag content properly. 

Tag with the flat language tag; which one should be obvious 99.9%
percent of the time. It may be underspecified, but it won't be
incorrect.