Unilingua

Sat Sep 17 07:48:17 CEST 2005

John,
Sorry my message wasn't clear.

Yes, encodings are below the radar of ordinary people. They are more likely
set by people that have some expertise or at least some motivation to do the
right thing. And yet, with a very small set of choices to make, they are
still generally wrong. And yes encodings are critical, and yet they are
still generally wrong. So I suggest this represents an upper bound of sorts
for how likely it is less motivated, less knowledgable, people will be to
use proper language tags for less critical needs. So language tags will be
more wrong on average than encoding tags.

Now with respect to people using choices like English, French, etc. yes of
course. Knowing when it is important to use en-sg vs. en or even when en-sg
or en-za is different from en-gb... I don't think so.

The difficulties in getting proper labels are social, agreed. So why offer
more choices and make it more difficult?
There are no practical or meaningful guides that we can give, so what we
will get is bupkiss. (Yes we set direction in 3066 but last I looked it is
hardly prescriptive.)

How do you know there are no other varieties of Japanese, so that ja is the
right answer?
I am under the impression there is a variant in Hokkaido and maybe others.
In any event, it is difficult to prove that none exists, and it is a claim
that can only be made by experts with knowledge of where Japanese speakers
are and how well their language conforms, not lay people tagging content.

Maybe it is a version of Japanese I intend to use worldwide so I should use
ja-001?

All I really know, is I went to the trouble of getting an in-country
translator, everyone said it was important, so it seems it should be ja-JP,
so I can distinguish my quality translations from those of the slackers that
don't use in-country translators.  ;-)

So yes there are guides in 3066bis, but I need an expert to interpret them,
especially for languages beyond the few most popular ones.

I don't want to spend too much time on boont, but if boont represents the
vocabulary, I consider it unlikely that x-boont might somehow be used in
other languages and somehow reflect ja-boont rather than en-boont. But fine,
then x-en-boont. The point was we need not have bothered the whole internet
community with the request nor burdened it with the ongoing weight of
remembering that once somebody needed a boont tag. It looks to me like 99%
of the boont discussion is not about boont text but what is the boont tag
for? (I call those midnight phone calls asking me what it is a "boonty
call".)

To your comment about "getting tags and not bothering us"-- way back when I
thought the point of this exercise was not to minimize requests to the
registrar or discussion of tags on this list, but to somehow insure that
internet practices were well-defined and behaved. I don't see how this can
be used by non-linguists to tag content properly. And if most people are
just going to use english, japanese, and other high level, less-specific
tags, then why would industry implement the full proposal?

tex

"John.Cowan" wrote:
> 
> Tex Texin scripsit:
> 
> > In another sphere we have a small number of character encodings, and we
> > can't get software to properly identify the encoding in play. Why should we
> > believe that with thousands of language codes available they will be used
> > properly?
> 
> Encodings and languages just aren't comparable, and that for two reasons.  First,
> encodings are below the radar of ordinary people, languages are not.  If you
> ask an author "What encoding is your letter/paper/memo/report in?", they
> will probably answer "Say what?"  But if you ask, "What language is your
> document written in?" you get a sensible answer, like "English", "French",
> "Japanese", or "Navajo".
> 
> Second, encodings are critical: if the encoding is wrong, the document
> contains mojibake of one sort or another.  If your Norwegian document
> is mistaken for French, the worst that happens is that it appears to be
> full of spelling errors or is pronounced wrongly by a text-to-speech
> program -- which aren't very reliable anyhow.
> 
> > Even with the small number of codes we have today, I have difficulty
> > determining which code properly describes a document.
> 
> There are two separate problems here:  (a) What language is the document in?
> (b) What is the proper code for that language?
> 
> For the first one, there is nothing you can do except get the answer from
> the author, per above.  There are no reliable third-party tests for distinguishing
> one language from another.
> 
> To determine which language tag to apply, one needs to match the author's idea
> of the language with the canonical name of the language.  This works well for
> widely known languages like English, but not so well for various Caribbean
> creoles.  Nevertheless, one can with the application of good will work out
> the answer.
> 
> Note that these difficulties are entirely social, not technical.
> 
> > There are no
> > guidelines or rules or ways to determine whether a document is one branch of
> > a language versus another, except with the crudest of guesses. Various
> > experts make pronouncements about Japanese being ja and not ja-jp, or latn
> > not being required for en, since en is not generally represented in another
> > script, but only an expert knows all of the possibilities and which
> > circumstances never (or nearly never) occur, and which ones require
> > additional descriptors or not. Given that is the case, I really don't need a
> > more refined set of language choices.
> 
> The rules are laid down in RFC 3066bis in more detail, but what it amounts to
> is, tag sensibly.  Since there are no other national varieties of Japanese,
> the tag ja-jp makes little sense (it's probably a garbled version of the
> *locale* tag ja_JP, which is quite a different matter).  The RFC 3066bis
> registry will contain explicit guidance on which languages should not
> have script tags when written in their usual scripts: some languages, however,
> don't have a single standard script.
> 
> > If and when someone gives me a way to review a document and determine the
> > proper language tag,
> 
> If you don't know what language it's in and what the context is, you can't
> tag the document at all.
> 
> > and we all agree on the right tag, and it doesn't
> > require three linguists to do the determination, I'll believe we have a
> > system worth all these refinements. Oh, and I also need to believe the
> > distinctions are something that my application may utilize.
> 
> Documents should be tagged accurately even if a particular application can't
> make use of the information.  Accuracy is not always precision, however.
> It may indeed require a linguist to tag a document written in some obscure
> language that has not been written before, but what's the alternative?
> 
> > I understand that for some very few purposes the ability to distinguish
> > between thousands of languages is useful.
> 
> > I just don't see that most users,
> > or most applications need it, and most content providers are incapable of
> > correctly tagging their content. So I don't see why we should burden general
> > applications with it.
> >
> > So what good has it done that we have registered Boontling? For all the web
> > pages and applications that do something with boontling, was the world
> > really much better than if we had left them on their own with x-boontling?
> 
> Because en-boont is basically en with strange vocabulary, and that fact
> matters.  x-boont could be *anything*.
> 
> > Is the world so much better that we registered boontling and denied or
> > delayed es-americas?
> 
> Fixed.
> 
> > The ISO 639 standards serve their purposes for linguists. The majority of
> > software on the internet does not require this level of distinction and does
> > not need to be burdened with it and I don't see that 3066bis will be
> > deployed the way it has been envisioned.
> 
> Every tag we've approved for some time now has been 3066bis compliant.
> This will only go on, and the demand will accelerate.  The point of
> 3066bis is to provide ways and means for people to have the tags they
> need without bothering us.
> 
> --
> John Cowan  jcowan at reutershealth.com  www.reutershealth.com  www.ccil.org/~cowan
> If a traveler were informed that such a man [as Lord John Russell] was
> leader of the House of Commons, he may well begin to comprehend how the
> Egyptians worshiped an insect.  --Benjamin Disraeli

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex at XenCraft.com
Xen Master                          http://www.i18nGuy.com

XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------