Spanglish

Thu Jan 5 10:56:43 CET 2017

I think that the original proposal didn't use terminology clearly enough,
so I rewrote the ticket to take some of these comments into account and add
further explanation. Here it is for further comment.

=====

We have gotten requests for locale identifiers to represent "hybrid"
locales, such as Hinglish, which can be used to select content that is a
"deep" mixture of a mixture of two (or more) languages. See the Background
below.
Proposal <http://unicode.org/cldr/trac/ticket/9956#Proposal>

This ticket proposes adding new T extension keys 'h0' and 'h1' for
identifying hybrid locales.

Examples:
es-t-*h0-en* Spanglish Spanish with an admixture of English
en-t-*h0-es* Spanglish English with an admixture of Spanish

*Note: the boundary between these two will be rather fuzzy, like most cases
in identifying. We'd recommend that es-t-h0-en be used unless English
clearly predominates.*

One could then also have
es-t-hi-*h0-en* Spanglish translated from Hindi

A second key 'h1' is defined indicating that the *source* language for
transform is a hybrid, much has we have done with the transliteration s0
and d0 keys. The value of h1 is a language tag that indicating that the
source language for -t- is a hybrid with that language, allowing
formulations like
es-t-hi-*h1-en* Spanish translated from Hinglish
es-t-hi-*h0-en*-*h1-en* Spanglish translated from Hinglish

If needed, one could even indicate what the script of the "mixed-in"
language is:
ru-t-*h0-en-latn* Runglish Russian with an admixture of English in Latin
script
ru-t-*h0-en-cyrl* Runglish Russian with an admixture of English in Cyrillic
script

Should we ever have need for hybrids of more than two languages,
corresponding pairs of keywords such as h2 and h3 can be defined.
Background <http://unicode.org/cldr/trac/ticket/9956#Background>

Hybrid locales have intermixed content from 2 (or more) languages, often
with one language's grammatical structure applied to words in another. See
also https://en.oxforddictionaries.com/definition/spanglish for the use of
the term “hybrid”. This is *not* simply content that has two languages in
it, such as a book of parallel text containing English and Spanish:
On the 24th of May, 1863, my uncle, Professor Liedenbrock, rushed into his
little house, No. 19 Königstrasse, one of the oldest streets in the oldest
portion of the city of Hamburg… El domingo 24 de mayo de 1863, mi tío, el
profesor Lidenbrock, regresó precipitadamente a su casa, situada en el
número 19 de la König-strasse, una de las calles más antiguas del barrio
viejo de Hamburgo…

While text in a document can be tagged as partly in one language and partly
in another, that is not the same having a hybrid locale. There is a
difference between having a Spanish document that has some passages quoted
in English and a Spanglish document. And fine-grained tagging doesn't work
handle combinations like Denglisch "gedownloadet" or the Franglais
"downloadé", cf http://www.duden.de/rechtschreibung/downloaden) which are
in neither language.

More importantly, it doesn't work for a very common use case: locale
selection. To communicate requests for localized content and
internationalization services, locales are used, which are an extension of
language tags. When people pick a language from a menu, internally they are
picking a locale (en-GB, es-419, etc). If you want an application to
support Spanglish or Hinglish, then you have to have a locale to represent
that.

Luckily, this falls within the scope of the T extension. While the title of
the RFC (https://tools.ietf.org/html/rfc6497) is “Transformed Content”,
the abstract makes it clear that the scope is broader than the term
"transformed" might indicate to a casual reader:

This document specifies an Extension to BCP 47 that provides subtags
for specifying the source language or script of transformed content,
including content that has been transliterated, transcribed, or
translated, or
*in some other way influenced by the source. It alsoprovides for additional
information used for identification.*

BTW, the U extension was never in question. Syntactically it does not allow
for values that have two letters, like language subtags, because they
collide with valid key values in the U extension. As a matter of fact, that
was the primary reason for the T extension. Had we been prescient when we
devised U, we would have only used keys that could never collide with
language subtags, and then never would have needed the T extension.
Mark

On Wed, Jan 4, 2017 at 3:07 AM, Peter Constable <petercon at microsoft.com>
wrote:

> There’s no such thing as “code-switch languages” unless you mean the
> individual languages that a speaker switches between when they are code
> switching. Definition:
>
>
>
> “Code switching is the practice of moving back and forth between two
> languages, or between two dialects or registers of the same language.”
>
>
>
> As I mentioned in a post last week, the issue at hand in Michael’s
> scenario is that a reader of content needs to have some level of competency
> in both English and Spanish (or pick whatever combination of multiple
> languages) in order for the content to be understandable and relevant.
>
>
>
> Neither the t or u extensions would be appropriate for this. A new “s”
> extension could be devised that has explicitly additive semantics:
> “en-s-es” means that both English _*and*_ Spanish are required to
> understand the content. Potentially, these could be chained, e.g.
> “en-s-s0-es-s1-fi” could mean that you need to speak English _*and*_
> Spanish _and” Finnish to understand the content, with the ordering
> providing a prioritization (e.g., if your proficiency in Finnish more
> limited than Spanish, you may get by). But a key question is what is
> required of matching, since general-purpose matching algorithms likely
> won’t pay attention to extensions.
>
>
>
>
>
> Peter
>
>
>
> *From:* Ietf-languages [mailto:ietf-languages-bounces at alvestrand.no] *On
> Behalf Of *Mark Davis ??
> *Sent:* Monday, January 2, 2017 11:53 PM
> *To:* Phillips, Addison <addison at lab126.com>
> *Cc:* ietflang IETF Languages Discussion <ietf-languages at iana.org>; John
> Cowan <cowan at ccil.org>
> *Subject:* Re: Spanglish
>
>
>
> Also, John raises a concern about being able to express transformations
> into code-switch languages. I added a comment with a reformulation to
> address that:
>
>
>
> As to John's concern in comment:1
> <http://unicode.org/cldr/trac/ticket/9956#comment:1> about being able to
> have a transformation of a code-switch language: I think that is a far less
> less important requirement than to have a general mechanism for code-switch
> languages.
>
> However, I think we can accommodate that — and at the same time alleviate
> some of people's concerns about the terms 'source' and 'target' — by
> changing the syntax so that the *value* of the c0 key is the language
> that is mixed into the main language tag. We then get tags structured as
> follows:
>
> es-t-*c0-en*
>
> Spanglish
>
> Spanish with an admixture of English
>
> en-t-*c0-es*
>
> Spanglish
>
> English with an admixture of Spanish
>
> *Note: the boundary between these two will be rather fuzzy, like most
> cases with languages. Probably best for these to recommend that es-t-c0-en
> be used unless English clearly predominates.*
>
> One could then have
>
> es-t-hi-*c0-en*
>
> Spanglish translated from Hindi
>
> Although it would be again quite infrequently used, we can easily allow
> for the case of a code-switch language being the source, and even have the
> translation of one code-switch language into another. We do this by using
> another keyword, much has we have done with the transliteration s0 and d0
> keys. So we define c1 as a language that is mixed into the source language
> for -t-, allowing formulations like
>
> es-t-hi-*c0-en*-*c1-en*
>
> Spanglish translated from Hinglish
>
>
>
> The more I think about it, the more I like this formulation.
>
>
> Mark
>
>
>
> On Tue, Jan 3, 2017 at 8:15 AM, Mark Davis ☕️ <mark at macchiato.com> wrote:
>
> -u- is syntacticly unsuitable, as well as being a worse fit semantically.
> You can use es-t-en-c0 or es-t-en-gb-c0. You can't use es-u-en-c0, or
> es-u-en-gb-c0 because any two letter subtag is a reserved keyword.
>
>
>
> I was not arguing in favor of using -u- extension for code-switch
> languages, just saying that it /is/ a broad mechanism.
>
>
> Mark
>
>
>
> On Mon, Jan 2, 2017 at 7:10 PM, Phillips, Addison <addison at lab126.com>
> wrote:
>
> >
> > > The much
> > > more general mechanism is the U one, which by now has a variety of
> > > different settings.
> >
> > Ah, yes, forgot about that. I think it would be much better then to use
> the U
> > extension.
> >
>
> The U extension is for Locale information. I don't think that fits any
> better. If anything, it's a worse fit.
>
> Addison
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.alvestrand.no/pipermail/ietf-languages/attachments/20170105/3b6fca45/attachment-0001.html>