Spanglish

Thu Jan 5 20:58:51 CET 2017

On 05-01-17 10:56, Mark Davis ☕️ wrote:
> Here it is for further comment.
> 

  Comments in between.

  My comments are mostly made with one particular but very important use
case in mind, and that is: text-to-speech engines, such as screen
readers for the blind.

  Other use cases may have different considerations.

> Hybrid locales have intermixed content from 2 (or more) languages, often
> with one language's grammatical structure applied to words in another.
> See also https://en.oxforddictionaries.com/definition/spanglish for the
> use of the term “hybrid”. This is /not/ simply content that has two
> languages in it, such as a book of parallel text containing English and
> Spanish:

  1. Both can - and must - be handled by fine-grained tagging, down to
the individual word level if needed.

  The "must" is because in both cases, the words tend to be pronounced
as they are in the original language. That is certainly so in books with
parallel text, and it seems to be the case with (most?) hybrids as well.
With Spanglish, the Spanish words are pronounced in the Spanish way, and
the English ones in the English way. At least that is my understanding.

  Without help from word-level tags, current text-to-speech engines will
pronounce everything as if it were English (in case the tag en-t-h0-es
is used) or Spanish (es-t-h0-en). This will make the text difficult to
understand (if not utterly incomprehensible) to a blind Spanglish speaker.

  Hence the need for fine-grained tagging.

  2. Such tagging would also benefit the publisher. Without it, a spell
checker would be useless.

  Spanglish does not currently seem to have a well-defined dictionary
and therefore a specialised spell checker is not feasible. That leaves
us with standard spell checkers.

  Now, running an en-spangl text without mark-up through an English
spell checker will result in all the Spanish words being marked as
errors, and the real spelling mistakes will drown in a sea of red.

  However, once the Spanish words are marked as such, the (standard)
Spanish spell checker will take care of (most of) them and the sea of
red will disappear, leaving only the real mistakes in plain view.

  Actually, the tagging would probably be done as a side effect of the
spell check. At least that's how I do it.

  3. Once the tagging is into place (i.e. presumably before the document
leaves its author's desk), the added value of the proposed "h" extension
to the -t- extension becomes questionable.

  The information about the language mix, that the proposal wants to
encode in it, can be learned (by language processing software) by
scanning through the document itself.

> And fine-grained tagging
> doesn't work handle combinations like Denglisch "gedownloadet" or the
> Franglais "downloadé" [...] which are in neither
> language.

  That is true, but if these words are in neither language, they are
equivalent to spelling mistakes in all the languages involved.

  From a purist viewpoint, mistakes need not be tagged, they should be
eliminated.

  From a more practical viewpoint, the tagger should ask herself how a
native French speaker would pronounce "downloadé". If it occurs in a
French text, and if it is pronounced in the French way, no separate tag
is needed. If it is pronounced (more or less) in the English way, it
should be tagged as "English". That way, the screen reader will produce
approximately the same sounds as a sighted French speaker reading the
text aloud.

> 
> More importantly, it doesn't work for a very common use case: locale
> selection. To communicate requests for localized content and
> internationalization services, locales are used, which are an extension
> of language tags. When people pick a language from a menu, internally
> they are picking a locale (en-GB, es-419, etc). If you want an
> application to support Spanglish or Hinglish, then you have to have a
> locale to represent that.

   1. That can be handled just fine with a regular tag/subtag combo,
such as en-spanglis or es-spanglis.

   The fact that the number of hybrid languages is potentially very
large is not a valid argument to do away with a perfectly valid
mechanism. We can handle the subtag requests on a case-by-case basis, if
and when a request comes in.

   2. Furthermore, an extension that just indicates the two languages,
such as the one proposed, would make it impossible to distinguish
between different hybrids of the same source languages.

  For example, from the talk page of the Wikipedia entry for Spanglish,
it seems that some people are advocating the idea that Tex-Mex is also
an hybrid of English and Spanish, but different from Spanglish - maybe
different enough to require different support from language processors.

   The extensible mechanism, as proposed, cannot cope with that, and
therefore does not meet the stated goal of enabling "locale selection"
to support hybrids.

  The existing subtag mechanism, on the other hand, doesn't have any
problem with that. If and when a request for a "texmex" subtag comes in,
we can examine it and decide if it is indeed sufficiently different from
Spanglish to justify its own subtag. If so, we register it, and if not,
we reject the request and tell the requester to tag with -spanglis.

> 
> Luckily, this falls within the scope of the T extension. While the title
> of the RFC (https://tools.ietf.org/html/rfc6497) is “Transformed
> Content”, the abstract makes it clear that the scope is broader than the
> term "transformed" might indicate to a casual reader: 

   That is actually an argument against extending the T extension, and
in favor of introducing a separate extension for these hybrids - if an
extension is needed at all (which I don't think is the case - see above).

   We cannot expect every single tagger to become familiar with all the
subtile wording in all the RFC's.

   They are just casual users, looking for an appropriate tag to stick
onto the document at hand. We should make that task as easy as possible.

   If the title says "Transformed Content", the casual reader will skip
that section instead of carefully examining the fine print that comes
with it.

   And of course, if a separate subtag is registered for each hybrid (as
I think it should), the casual user will not even have to dig up the
RFC's. She would simply search the registry for "Spanglish" and come up
with the correct record immediately, comments and all.

   And to repeat myself: sticking a tag - any tag (i.e. extension or
regular subtag) - into the document head does not absolve the author
from the requirement to apply fine-grained tagging inside the document.

   Luc