Peter_Constable@sil.org Peter_Constable@sil.org
Wed, 4 Sep 2002 13:08:44 -0500

On 09/04/2002 12:20:18 AM Tex Texin wrote:

>I think there is value in such a tag, not as a language tag, but as an
>indication of the intended market(s) or class of users.
>These are two very different things....

I certainly agree that es-americas has a limitation in that there are no
guidelines for spell or grammar checking (though, interestingly, I bet the
translators in John's example do check spelling against some convention). I
see, too, your point that the "language" of the kind of content in question
is in some sense artificial, and that it may be wise not to confuse such an
artificial, market-driven construct with natural languages.

There is a problem, though, and we should be looking for some way to solve
it. Let's consider the general problem to be solved using John's situation
as an example: Reuters has subscribers throughout the world, including many
parts of Latin America. The have content that has been translated into
Spanish in a way that is assumed to be acceptable to Spanish-speaking
subscribers in Latin America, but not necessarily elsewhere. When
subscribers ask for content, they somehow give some indication of their
language preferences. In current implementations, this can be done in Web
browsers in a way that results in the use of the HTTP accept-language
header. (That's not to say that another mechanism couldn't be devised.) So,
when Spanish-speaking subscribers in Colombia or Mexico or Argentina
request a given article, John wants their system to return the same
resource in each case. The current problem is that those users may be
specifying their language preferences in terms of es-MX or es-CO or es-AR.
On the other hand, however, the single resource cannot be catalogued with
metadata tagging equivalent to lang="es-MX, es-CO, es-AR, ...". (Note that
doing that is equally problematic in terms of implications for spell or
grammar checking.)

One possible solution could be to create duplicate copies of the single
resource and tag each for a different country-variant, es-MX, es-CO, etc.
Another option is the one that involves es-americas, and it would involve a
revision to the server software so that it would know that requests for
(e.g.) es-MX can be satisfactorily filled by returning content tagged as
es-americas. Now, it sounds to me that you're thinking of perhaps another
approach. I'm curious to know what you have in mind.

Assuming another solution that involves something other than a "language"
tag, we are still left with the question as to what an appropriate language
tag for John's content would be. It appears that "es" is less than ideal,
since it is not acceptable to speakers from Spain. On the other hand, as
John pointed out, labelling it as "es-MX" (say) is simply an arbitrary
choice, and it also puts an inappropriate bias on the data that may limit
its use unnecessarily. Whatever solution we adopt for how to deliver the
given resource to the market for which it was intended, we still have to
consider how we will deal with this problem.

As for the question of spell and grammar checking, I wonder if this is
actually a non-issue. Perhaps someone more familiar with the Spanish case
can comment in regard to that case. In particular, I'm thinking that, if
the content is truly neutral wth regard to the sub-language variants, then
for cataloguing / retrieval purposes one should be able to make multiple
copies with each tagged for a single country variant -- es-MX, es-CO, etc.
But then, tagging a copy in this way suggests that that given resouce can
be spell or grammar checked using the conventions appropriate to the given
country, es-MX or whatever, and that such checking should not result in
errors. In other words, you should be able to take the original resource
and run any or all of the various spell and grammar checkers without
encountering problems. I'm wondering whether, in a case such as Spanish
(e.g. Reuters' Latin american Spanish content) whether that is, in fact,
possible in practice.

- Peter

Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>