Language for taxonomic names, redux

Felix Sasaki fsasaki at w3.org
Mon Feb 27 23:13:12 CET 2017


> Am 27.02.2017 um 21:42 schrieb Luc Pardon <lucp at skopos.be>:
> 
> 
> 
> On 27-02-17 03:42, Martin J. Dürst wrote:
>> On 2017/02/25 00:15, Michael Everson wrote:
>>> Regarding automatic translation:
>>> 
>>> http://www.galeriainfo.hu/index1.php?link=muveszeink&muvesz_id=174
>>> when autotranslated by Google translates “Hosszú György” as “George
>>> Long”. Clearly that’s not desirable (even if from one point of view it
>>> might be “accurate") — but there’s no way to use language tagging to
>>> achieve “protection” of the personal name, is there?
> 
>   No, language tagging won't help. This is a job for semantic tagging,
> i.e. embed it in something like <person></person>. See below.
> 
>> 
>> Well, one way might be to just declare the content to not be in any
>> language. That should avoid translation, but in many cases won't feel
>> right.
> 
>   Right.
> 
>   For all (or at least many) other practical purposes, it _is_
> Hungarian, e.g. it is pronounced as Hungarian in the original text, and
> after translation, it may still have to be "spoken" in the Hungarian way
> by a TTS engine, depending on many things. If you declare it "not in any
> language", the TTS engine will be utterly clueless.
> 
> 
>> 
>> But when it comes to indicating whether something should be translated
>> or not, the best piece of work is ITS (Internationalization Tag Set).
>> Now at version 2.0 (see https://www.w3.org/TR/its20/), it covers a lot
>> of ground, but the initial motivation was to indicate
>> (non)translatability in markup (HTML or XML).
>> 
>   I was not aware of that ITS thing, and after skimming through it, I
> find it profoundly disturbing.

People in the localization industry found ITS very helpful - so helpful that the next version of the major format for localization information interchange - XLIFFF 2.1 - will have a module to natively support ITS 2.0.

I guess what is distributing and what is useful depends on the use case. I have not followed this thread here, but for the use case of that is translatable and what not, the translate information cited by Martin is a good fit IMO. 

> 
>   To decision to translate or not to translate depends very much on the
> target language, so this kind of markup would only be useful if applied
> by a human translator, before machine translation takes over.

In localization workflows there is often a localization engineer involved, not a translator. For a localization engineer, ITS can be very handy for providing information about translatability for a larger set of files. As a matter of fact, ITS has been developed with a lot of input from localization engineers, and with such a batch processing scenario in mind.

> 
>   This is particularly clear - I hope - with geographical names.
> 
>   "New Orleans", for example, is not to be translated into Dutch, but
> it has to become "Nueva Orleans" in Spanish, "Nov-Orleano" in Esperanto
> and "Nouvelle Orléans" in French.
> 
>   On the other hand, "New York" will become "Nueva York" in Spanish and
> "Novjorko" in Esperanto, but it will stay "New York" in French and Dutch.
> 
>   At least the above is true if "New Orleans" is referring to the city.
> It is not to be translated at all, however, if it is referring to the
> name of a ship, or the name of a musical style, or the title of a movie
> (unless the movie itself is translated or subtitled).
> 
>   And "Athens" is usually to be translated if it refers to the capital
> of Greece, but is to be left alone if it refers to one of the many
> namesake cities in the U.S. - but again, I don't know all the languages
> of the world.
> 
>   The same is true for proper names. "George Bush" is not to be
> translated into any language I am aware of, but "Mark Anthony" has to
> become "Marcus Antonius" in Dutch and "Markos Antonios" in Greek -
> assuming they are referring to the American emperor and the Roman
> president, respectively (and yes, I know).
> 
>   And these are just the obvious cases. There are also word plays,
> historical references and many other things that would get lost when
> translating verbatim, and that only somebody with a profound knowledge
> of the target language (as well as the source language) can detect and
> handle.

I won’t argue that ITS is suitable to govern the process of name translation. But for the use case of governing that a string should be changed or not with a given target language, ITS provides an adequate means, see the XPath expression example provided by Martin.

> 
> 
>   Now, I read in the "overview" that ITS assumes a 3-step process, i.e.
> "internationalization, translation, and localization". That's fine,
> except that, from reading further, it seems obvious that these ITS tags
> are supposed to be applied in the first phase, and that only "[d]uring
> the translation phase, the meaning of a source language text is
> analyzed, and a target language text that is equivalent in meaning is
> determined".
> 
>  Clearly this simplistic approach won't wash except in the most
> simplistic of circumstances.


Which is: in 90+ percent of what happens in localization workflows, which have triggered the development of ITS. ITS is focusing on automation, which will never be perfect. 

> Otherwise, as I said above, the "no
> translate" tag must be added during the translation step, thereby
> defeating the very purpose of enabling automated machine translation.

translate=no is used in HTML5 and in many XML formats. The people defining these formats would not have taken the effort to define this piece of markup without industry pressure - pressure coming from localization engineers, asking to make their live easier and to foster automation.  

> 
>  It may actually make things worse, because these geographic names (and
> other things) would typically be contained in any decent (say)
> English-to-French dictionary, but that won't help if the omnilingual
> "internationaliser" has naïvely declared "New Orleans" as "off limits"
> to any and all translation engines.
> 
>  I do not intend to participate in further discussion of ITS, as it is
> off-topic here.
> 
>> This led to the addition of the 'translate' attribute to HTML5. This
>> makes it possible to mark up the above as
>>    <span translate=no>Hosszú György</span>
>> ITS allows rules to indicate translatability. 
> 
>   It would be much more helpful if semantic tagging were to be expanded
> and standardized more broadly.
> 
>   For example, the NITF (News Industry Text Format) defines things like
> nitf:person, nitf:money, etc. It also has nitf:city, nitf:location and
> nitf:region (for names of places).
> 
>   See https://iptc.org/standards/nitf/
> 
>   That would take care of all the issues we have been discussing here,
> and a few more.
> 
>  A human translator would know these things when (s)he recognises -
> from the context - some words as being (e.g.) a proper name, but a
> computer needs help. That is where semantic tagging comes in.

In the development of ITS, a „context“ data category has been discussed as well. At the end the definition of such a data category was abandoned, because everybody would define context differently. E.g many people have a notion of a person but are not aware of nitf:person .


> 
> 
>> For the case we are
>> discussing, 
> 
>  And that - taxons - is the only thing that should concern us here and now.
> 
>> adding something like the following to the HTML5 <head>
>> element will indicate that text in taxonomic Latin isn't to be translated:
>> 
>>    <script type=application/its+xml id=ru1>
>>      <its:rules version="2.0" xmlns:its="http://www.w3.org/2005/11/its"
>>           xmlns:h="http://www.w3.org/1999/xhtml">
>>        <its:translateRule translate="no"
>>                   selector='//*[@lang="la-taxon"]'/>
>>      </its:rules>
>>    </script>
>> 
> 
>   That is absolutely redundant. The "lang=la-taxon" tag itself would
> tell the translation engine all it needs to know.

Well, translation engines - and localization tools - understand translate=no natively. You don’t need to teach them anything. 

> 
>   Yes, I know, that needs support from the translation engine. But it
> has to handle many much more other interesting things already, so this
> case should be a piece of cake to add support for: "if it's la-taxon,
> skip it“.


 I wouldn’t underestimate the effort. It took 4-5 years until there was agreement to add a „translate" attribute to HTML5.

Regards,

Felix

> 
>   Or maybe it does not even need extra support to be added. Say it is
> translating an English text into Hungarian, and it runs into an embedded
> "lang-fr" tag, what is it supposed to do? Not throw a fit, I hope...
> Either it pulls in a French-to-Hungarian dictionary, or, if it can't
> find one, it will not attempt to translate. And since it won't find a
> "Taxonomic Latin-to-Hungarian" dictionary, it won't translate. Voila.
> 
>   Luc
> 
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages



More information about the Ietf-languages mailing list