Spanglish

Thu Jan 5 20:35:47 CET 2017

I would really like to understand how this is supposed to be used for locale selection.

AFAICT there are many dissimilar Spanglishes that may be mutually unintelligible.  For other mixtures it seems like it may be even worse.

It doesn’t help to give the user menu items in Spanglish if it’s not in a form of Spanglish they can read.

Worst case, presenting a different “wrong” mix could appear like I’m teasing or belittling you for mixing languages.

Is there some context I’m missing where sufficient consistency of grammar and vocabulary is present to make it useful to a user community?  If there are multiple such communities, how would they be differentiated?

-Shawn

From: Ietf-languages [mailto:ietf-languages-bounces at alvestrand.no] On Behalf Of Mark Davis ??
Sent: Thursday, January 5, 2017 1:57 AM
To: Peter Constable <petercon at microsoft.com>
Cc: ietflang IETF Languages Discussion <ietf-languages at iana.org>; John Cowan <cowan at ccil.org>
Subject: Re: Spanglish

I think that the original proposal didn't use terminology clearly enough, so I rewrote the ticket to take some of these comments into account and add further explanation. Here it is for further comment.

=====

We have gotten requests for locale identifiers to represent "hybrid" locales, such as Hinglish, which can be used to select content that is a "deep" mixture of a mixture of two (or more) languages. See the Background below.

Proposal

This ticket proposes adding new T extension keys 'h0' and 'h1' for identifying hybrid locales.

Examples:
es-t-h0-en

Spanglish

Spanish with an admixture of English

en-t-h0-es

Spanglish

English with an admixture of Spanish

Note: the boundary between these two will be rather fuzzy, like most cases in identifying. We'd recommend that es-t-h0-en be used unless English clearly predominates.

One could then also have
es-t-hi-h0-en

Spanglish translated from Hindi

A second key 'h1' is defined indicating that the source language for transform is a hybrid, much has we have done with the transliteration s0 and d0 keys. The value of h1 is a language tag that indicating that the source language for -t- is a hybrid with that language, allowing formulations like
es-t-hi-h1-en

Spanish translated from Hinglish

es-t-hi-h0-en-h1-en

Spanglish translated from Hinglish

If needed, one could even indicate what the script of the "mixed-in" language is:
ru-t-h0-en-latn

Runglish

Russian with an admixture of English in Latin script

ru-t-h0-en-cyrl

Runglish

Russian with an admixture of English in Cyrillic script

Should we ever have need for hybrids of more than two languages, corresponding pairs of keywords such as h2 and h3 can be defined.

Background

Hybrid locales have intermixed content from 2 (or more) languages, often with one language's grammatical structure applied to words in another. See also https://en.oxforddictionaries.com/definition/spanglish<https://en.oxforddictionaries.com/definition/spanglish> for the use of the term “hybrid”. This is not simply content that has two languages in it, such as a book of parallel text containing English and Spanish:
On the 24th of May, 1863, my uncle, Professor Liedenbrock, rushed into his little house, No. 19 Königstrasse, one of the oldest streets in the oldest portion of the city of Hamburg…

El domingo 24 de mayo de 1863, mi tío, el profesor Lidenbrock, regresó precipitadamente a su casa, situada en el número 19 de la König-strasse, una de las calles más antiguas del barrio viejo de Hamburgo…

While text in a document can be tagged as partly in one language and partly in another, that is not the same having a hybrid locale. There is a difference between having a Spanish document that has some passages quoted in English and a Spanglish document. And fine-grained tagging doesn't work handle combinations like Denglisch "gedownloadet" or the Franglais "downloadé", cf http://www.duden.de/rechtschreibung/downloaden<http://www.duden.de/rechtschreibung/downloaden>) which are in neither language.

More importantly, it doesn't work for a very common use case: locale selection. To communicate requests for localized content and internationalization services, locales are used, which are an extension of language tags. When people pick a language from a menu, internally they are picking a locale (en-GB, es-419, etc). If you want an application to support Spanglish or Hinglish, then you have to have a locale to represent that.

Luckily, this falls within the scope of the T extension. While the title of the RFC (https://tools.ietf.org/html/rfc6497<https://tools.ietf.org/html/rfc6497>) is “Transformed Content”, the abstract makes it clear that the scope is broader than the term "transformed" might indicate to a casual reader:

This document specifies an Extension to BCP 47 that provides subtags
for specifying the source language or script of transformed content,
including content that has been transliterated, transcribed, or
translated, or in some other way influenced by the source. It also
provides for additional information used for identification.

BTW, the U extension was never in question. Syntactically it does not allow for values that have two letters, like language subtags, because they collide with valid key values in the U extension. As a matter of fact, that was the primary reason for the T extension. Had we been prescient when we devised U, we would have only used keys that could never collide with language subtags, and then never would have needed the T extension.
Mark

On Wed, Jan 4, 2017 at 3:07 AM, Peter Constable <petercon at microsoft.com<mailto:petercon at microsoft.com>> wrote:
There’s no such thing as “code-switch languages” unless you mean the individual languages that a speaker switches between when they are code switching. Definition:

“Code switching is the practice of moving back and forth between two languages, or between two dialects or registers of the same language.”

As I mentioned in a post last week, the issue at hand in Michael’s scenario is that a reader of content needs to have some level of competency in both English and Spanish (or pick whatever combination of multiple languages) in order for the content to be understandable and relevant.

Neither the t or u extensions would be appropriate for this. A new “s” extension could be devised that has explicitly additive semantics: “en-s-es” means that both English _and_ Spanish are required to understand the content. Potentially, these could be chained, e.g. “en-s-s0-es-s1-fi” could mean that you need to speak English _and_ Spanish _and” Finnish to understand the content, with the ordering providing a prioritization (e.g., if your proficiency in Finnish more limited than Spanish, you may get by). But a key question is what is required of matching, since general-purpose matching algorithms likely won’t pay attention to extensions.

Peter

From: Ietf-languages [mailto:ietf-languages-bounces at alvestrand.no<mailto:ietf-languages-bounces at alvestrand.no>] On Behalf Of Mark Davis ??
Sent: Monday, January 2, 2017 11:53 PM
To: Phillips, Addison <addison at lab126.com<mailto:addison at lab126.com>>
Cc: ietflang IETF Languages Discussion <ietf-languages at iana.org<mailto:ietf-languages at iana.org>>; John Cowan <cowan at ccil.org<mailto:cowan at ccil.org>>
Subject: Re: Spanglish

Also, John raises a concern about being able to express transformations into code-switch languages. I added a comment with a reformulation to address that:

As to John's concern in comment:1<http://unicode.org/cldr/trac/ticket/9956#comment:1> about being able to have a transformation of a code-switch language: I think that is a far less less important requirement than to have a general mechanism for code-switch languages.

However, I think we can accommodate that — and at the same time alleviate some of people's concerns about the terms 'source' and 'target' — by changing the syntax so that the value of the c0 key is the language that is mixed into the main language tag. We then get tags structured as follows:
es-t-c0-en

Spanglish

Spanish with an admixture of English

en-t-c0-es

Spanglish

English with an admixture of Spanish

Note: the boundary between these two will be rather fuzzy, like most cases with languages. Probably best for these to recommend that es-t-c0-en be used unless English clearly predominates.

One could then have
es-t-hi-c0-en

Spanglish translated from Hindi

Although it would be again quite infrequently used, we can easily allow for the case of a code-switch language being the source, and even have the translation of one code-switch language into another. We do this by using another keyword, much has we have done with the transliteration s0 and d0 keys. So we define c1 as a language that is mixed into the source language for -t-, allowing formulations like
es-t-hi-c0-en-c1-en

Spanglish translated from Hinglish

The more I think about it, the more I like this formulation.

Mark

On Tue, Jan 3, 2017 at 8:15 AM, Mark Davis ☕️ <mark at macchiato.com<mailto:mark at macchiato.com>> wrote:
-u- is syntacticly unsuitable, as well as being a worse fit semantically. You can use es-t-en-c0 or es-t-en-gb-c0. You can't use es-u-en-c0, or es-u-en-gb-c0 because any two letter subtag is a reserved keyword.

I was not arguing in favor of using -u- extension for code-switch languages, just saying that it /is/ a broad mechanism.

Mark

On Mon, Jan 2, 2017 at 7:10 PM, Phillips, Addison <addison at lab126.com<mailto:addison at lab126.com>> wrote:
>
> > The much
> > more general mechanism is the U one, which by now has a variety of
> > different settings.
>
> Ah, yes, forgot about that. I think it would be much better then to use the U
> extension.
>

The U extension is for Locale information. I don't think that fits any better. If anything, it's a worse fit.

Addison

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/ietf-languages/attachments/20170105/194df97a/attachment-0001.html>