Language for taxonomic names, redux

Mon Feb 27 21:42:58 CET 2017

On 27-02-17 03:42, Martin J. Dürst wrote:
> On 2017/02/25 00:15, Michael Everson wrote:
>> Regarding automatic translation:
>>
>> http://www.galeriainfo.hu/index1.php?link=muveszeink&muvesz_id=174
>> when autotranslated by Google translates “Hosszú György” as “George
>> Long”. Clearly that’s not desirable (even if from one point of view it
>> might be “accurate") — but there’s no way to use language tagging to
>> achieve “protection” of the personal name, is there?

   No, language tagging won't help. This is a job for semantic tagging,
i.e. embed it in something like <person></person>. See below.

> 
> Well, one way might be to just declare the content to not be in any
> language. That should avoid translation, but in many cases won't feel
> right.

   Right.

   For all (or at least many) other practical purposes, it _is_
Hungarian, e.g. it is pronounced as Hungarian in the original text, and
after translation, it may still have to be "spoken" in the Hungarian way
by a TTS engine, depending on many things. If you declare it "not in any
language", the TTS engine will be utterly clueless.

> 
> But when it comes to indicating whether something should be translated
> or not, the best piece of work is ITS (Internationalization Tag Set).
> Now at version 2.0 (see https://www.w3.org/TR/its20/), it covers a lot
> of ground, but the initial motivation was to indicate
> (non)translatability in markup (HTML or XML).
> 
   I was not aware of that ITS thing, and after skimming through it, I
find it profoundly disturbing.

   To decision to translate or not to translate depends very much on the
target language, so this kind of markup would only be useful if applied
by a human translator, before machine translation takes over.

   This is particularly clear - I hope - with geographical names.

   "New Orleans", for example, is not to be translated into Dutch, but
it has to become "Nueva Orleans" in Spanish, "Nov-Orleano" in Esperanto
and "Nouvelle Orléans" in French.

   On the other hand, "New York" will become "Nueva York" in Spanish and
"Novjorko" in Esperanto, but it will stay "New York" in French and Dutch.

   At least the above is true if "New Orleans" is referring to the city.
It is not to be translated at all, however, if it is referring to the
name of a ship, or the name of a musical style, or the title of a movie
(unless the movie itself is translated or subtitled).

   And "Athens" is usually to be translated if it refers to the capital
of Greece, but is to be left alone if it refers to one of the many
namesake cities in the U.S. - but again, I don't know all the languages
of the world.

   The same is true for proper names. "George Bush" is not to be
translated into any language I am aware of, but "Mark Anthony" has to
become "Marcus Antonius" in Dutch and "Markos Antonios" in Greek -
assuming they are referring to the American emperor and the Roman
president, respectively (and yes, I know).

   And these are just the obvious cases. There are also word plays,
historical references and many other things that would get lost when
translating verbatim, and that only somebody with a profound knowledge
of the target language (as well as the source language) can detect and
handle.

   Now, I read in the "overview" that ITS assumes a 3-step process, i.e.
"internationalization, translation, and localization". That's fine,
except that, from reading further, it seems obvious that these ITS tags
are supposed to be applied in the first phase, and that only "[d]uring
the translation phase, the meaning of a source language text is
analyzed, and a target language text that is equivalent in meaning is
determined".

  Clearly this simplistic approach won't wash except in the most
simplistic of circumstances. Otherwise, as I said above, the "no
translate" tag must be added during the translation step, thereby
defeating the very purpose of enabling automated machine translation.

  It may actually make things worse, because these geographic names (and
other things) would typically be contained in any decent (say)
English-to-French dictionary, but that won't help if the omnilingual
"internationaliser" has naïvely declared "New Orleans" as "off limits"
to any and all translation engines.

  I do not intend to participate in further discussion of ITS, as it is
off-topic here.

> This led to the addition of the 'translate' attribute to HTML5. This
> makes it possible to mark up the above as
>     <span translate=no>Hosszú György</span>
> ITS allows rules to indicate translatability. 

   It would be much more helpful if semantic tagging were to be expanded
and standardized more broadly.

   For example, the NITF (News Industry Text Format) defines things like
nitf:person, nitf:money, etc. It also has nitf:city, nitf:location and
nitf:region (for names of places).

   See https://iptc.org/standards/nitf/

   That would take care of all the issues we have been discussing here,
and a few more.

  A human translator would know these things when (s)he recognises -
from the context - some words as being (e.g.) a proper name, but a
computer needs help. That is where semantic tagging comes in.

> For the case we are
> discussing, 

  And that - taxons - is the only thing that should concern us here and now.

> adding something like the following to the HTML5 <head>
> element will indicate that text in taxonomic Latin isn't to be translated:
> 
>     <script type=application/its+xml id=ru1>
>       <its:rules version="2.0" xmlns:its="http://www.w3.org/2005/11/its"
>            xmlns:h="http://www.w3.org/1999/xhtml">
>         <its:translateRule translate="no"
>                    selector='//*[@lang="la-taxon"]'/>
>       </its:rules>
>     </script>
> 

   That is absolutely redundant. The "lang=la-taxon" tag itself would
tell the translation engine all it needs to know.

   Yes, I know, that needs support from the translation engine. But it
has to handle many much more other interesting things already, so this
case should be a piece of cake to add support for: "if it's la-taxon,
skip it".

   Or maybe it does not even need extra support to be added. Say it is
translating an English text into Hungarian, and it runs into an embedded
"lang-fr" tag, what is it supposed to do? Not throw a fit, I hope...
Either it pulls in a French-to-Hungarian dictionary, or, if it can't
find one, it will not attempt to translate. And since it won't find a
"Taxonomic Latin-to-Hungarian" dictionary, it won't translate. Voila.

   Luc