Spanglish

Wed Dec 21 19:49:26 CET 2016

Wow, such a long thread!

There's a not-unreasonable tendency to focus on how to describe what something _is_. It is a text that uses a code-switching mixture of English and Spanish. But another important consideration is to ask what problem are we trying to solve.

One approach is to get granular, as Martin suggested: tag each phrase or word that's English or that's Spanish. That would be a correct thing to do, and it would be the best thing to do if the problem is to get accurate spell-checking. 

But fine-granularity tagging wouldn't provide a general characterization of the document as a whole, which is what seems to be desired here. But why? What problem is in mind? 

One possible problem is to characterize the document to match user relevance. In the examples I've seen provided in this thread, a reader would need to be proficient (at some level) in both English and Spanish in order for the document to be relevant. So, for solving this problem, what is needed is not simply a conjunction of two languages (this document contains X and Y), but also an indication that both are essential.

So, if we assuming _that_ problem (we could repeat this assuming other problems to be solved), how to characterize the document as a whole? 

One suggestion in the thread is to use "mul". That would all but useless as it only says that the document is not uniformly one language.

Another suggestion in the thread is to use the "-t" extension. It's not clear to me how that would be helpful. If I see "en-t-es", that suggests the text was translated from Spanish, and hence (e.g.) that there may be somethings less clear than in the original. But the tag clearly suggests that a reader need be proficient only in English in order to read the document. Now, the "-t" extension could be extended to handle a new semantic to indicate two required languages. But remember that many processes, including most matching processes, will ignore the extension.

Michael's suggestion of a variant subtag does provide a solution to this problem to some degree. If I see "es-spanglis", it's clear from the primary subtag that the reader must be proficient in Spanish, but then the variant subtag is adding a qualification — a certain kind of Spanish. The limitation is that, without specific understanding of the variant subtag, it's not clear how the qualification aligns to the user. Matching algorithms will match the document if "es-spanglis" is explicitly requested, but only as a fallback if either or both of "en" and "es" are requested.

I think the best way to handle this particular problem is not within a language tag at all, but rather at a higher level. If a user's language-preference list includes "en" and includes "es", and if the document is cataloged as having two essential-language values, then a matching algorithm that knows to compare all essential-language values against the user's list will work. But, of course, that's a lot of higher-level infrastructure that would need to go into lots of different databases and content-delivery mechanisms. So, while this may be conceptually best, it's certainly not the most practical or feasible.

So, evaluating these options, I think Michael's suggestion, while not ideal, may be the most useful, practical and feasible, with the best ROI. 

This is assuming one particular problem. It may not be suit all scenarios. In particular, linguists studying language contact may need more. And I wouldn't expect any "es-spanglis" spelling checkers any time soon.

Peter