Language for taxonomic names, redux
Felix Sasaki
fsasaki at w3.org
Mon Feb 27 23:13:12 CET 2017
> Am 27.02.2017 um 21:42 schrieb Luc Pardon <lucp at skopos.be>:
>
>
>
> On 27-02-17 03:42, Martin J. Dürst wrote:
>> On 2017/02/25 00:15, Michael Everson wrote:
>>> Regarding automatic translation:
>>>
>>> http://www.galeriainfo.hu/index1.php?link=muveszeink&muvesz_id=174
>>> when autotranslated by Google translates “Hosszú György” as “George
>>> Long”. Clearly that’s not desirable (even if from one point of view it
>>> might be “accurate") — but there’s no way to use language tagging to
>>> achieve “protection” of the personal name, is there?
>
> No, language tagging won't help. This is a job for semantic tagging,
> i.e. embed it in something like <person></person>. See below.
>
>>
>> Well, one way might be to just declare the content to not be in any
>> language. That should avoid translation, but in many cases won't feel
>> right.
>
> Right.
>
> For all (or at least many) other practical purposes, it _is_
> Hungarian, e.g. it is pronounced as Hungarian in the original text, and
> after translation, it may still have to be "spoken" in the Hungarian way
> by a TTS engine, depending on many things. If you declare it "not in any
> language", the TTS engine will be utterly clueless.
>
>
>>
>> But when it comes to indicating whether something should be translated
>> or not, the best piece of work is ITS (Internationalization Tag Set).
>> Now at version 2.0 (see https://www.w3.org/TR/its20/), it covers a lot
>> of ground, but the initial motivation was to indicate
>> (non)translatability in markup (HTML or XML).
>>
> I was not aware of that ITS thing, and after skimming through it, I
> find it profoundly disturbing.
People in the localization industry found ITS very helpful - so helpful that the next version of the major format for localization information interchange - XLIFFF 2.1 - will have a module to natively support ITS 2.0.
I guess what is distributing and what is useful depends on the use case. I have not followed this thread here, but for the use case of that is translatable and what not, the translate information cited by Martin is a good fit IMO.
>
> To decision to translate or not to translate depends very much on the
> target language, so this kind of markup would only be useful if applied
> by a human translator, before machine translation takes over.
In localization workflows there is often a localization engineer involved, not a translator. For a localization engineer, ITS can be very handy for providing information about translatability for a larger set of files. As a matter of fact, ITS has been developed with a lot of input from localization engineers, and with such a batch processing scenario in mind.
>
> This is particularly clear - I hope - with geographical names.
>
> "New Orleans", for example, is not to be translated into Dutch, but
> it has to become "Nueva Orleans" in Spanish, "Nov-Orleano" in Esperanto
> and "Nouvelle Orléans" in French.
>
> On the other hand, "New York" will become "Nueva York" in Spanish and
> "Novjorko" in Esperanto, but it will stay "New York" in French and Dutch.
>
> At least the above is true if "New Orleans" is referring to the city.
> It is not to be translated at all, however, if it is referring to the
> name of a ship, or the name of a musical style, or the title of a movie
> (unless the movie itself is translated or subtitled).
>
> And "Athens" is usually to be translated if it refers to the capital
> of Greece, but is to be left alone if it refers to one of the many
> namesake cities in the U.S. - but again, I don't know all the languages
> of the world.
>
> The same is true for proper names. "George Bush" is not to be
> translated into any language I am aware of, but "Mark Anthony" has to
> become "Marcus Antonius" in Dutch and "Markos Antonios" in Greek -
> assuming they are referring to the American emperor and the Roman
> president, respectively (and yes, I know).
>
> And these are just the obvious cases. There are also word plays,
> historical references and many other things that would get lost when
> translating verbatim, and that only somebody with a profound knowledge
> of the target language (as well as the source language) can detect and
> handle.
I won’t argue that ITS is suitable to govern the process of name translation. But for the use case of governing that a string should be changed or not with a given target language, ITS provides an adequate means, see the XPath expression example provided by Martin.
>
>
> Now, I read in the "overview" that ITS assumes a 3-step process, i.e.
> "internationalization, translation, and localization". That's fine,
> except that, from reading further, it seems obvious that these ITS tags
> are supposed to be applied in the first phase, and that only "[d]uring
> the translation phase, the meaning of a source language text is
> analyzed, and a target language text that is equivalent in meaning is
> determined".
>
> Clearly this simplistic approach won't wash except in the most
> simplistic of circumstances.
Which is: in 90+ percent of what happens in localization workflows, which have triggered the development of ITS. ITS is focusing on automation, which will never be perfect.
> Otherwise, as I said above, the "no
> translate" tag must be added during the translation step, thereby
> defeating the very purpose of enabling automated machine translation.
translate=no is used in HTML5 and in many XML formats. The people defining these formats would not have taken the effort to define this piece of markup without industry pressure - pressure coming from localization engineers, asking to make their live easier and to foster automation.
>
> It may actually make things worse, because these geographic names (and
> other things) would typically be contained in any decent (say)
> English-to-French dictionary, but that won't help if the omnilingual
> "internationaliser" has naïvely declared "New Orleans" as "off limits"
> to any and all translation engines.
>
> I do not intend to participate in further discussion of ITS, as it is
> off-topic here.
>
>> This led to the addition of the 'translate' attribute to HTML5. This
>> makes it possible to mark up the above as
>> <span translate=no>Hosszú György</span>
>> ITS allows rules to indicate translatability.
>
> It would be much more helpful if semantic tagging were to be expanded
> and standardized more broadly.
>
> For example, the NITF (News Industry Text Format) defines things like
> nitf:person, nitf:money, etc. It also has nitf:city, nitf:location and
> nitf:region (for names of places).
>
> See https://iptc.org/standards/nitf/
>
> That would take care of all the issues we have been discussing here,
> and a few more.
>
> A human translator would know these things when (s)he recognises -
> from the context - some words as being (e.g.) a proper name, but a
> computer needs help. That is where semantic tagging comes in.
In the development of ITS, a „context“ data category has been discussed as well. At the end the definition of such a data category was abandoned, because everybody would define context differently. E.g many people have a notion of a person but are not aware of nitf:person .
>
>
>> For the case we are
>> discussing,
>
> And that - taxons - is the only thing that should concern us here and now.
>
>> adding something like the following to the HTML5 <head>
>> element will indicate that text in taxonomic Latin isn't to be translated:
>>
>> <script type=application/its+xml id=ru1>
>> <its:rules version="2.0" xmlns:its="http://www.w3.org/2005/11/its"
>> xmlns:h="http://www.w3.org/1999/xhtml">
>> <its:translateRule translate="no"
>> selector='//*[@lang="la-taxon"]'/>
>> </its:rules>
>> </script>
>>
>
> That is absolutely redundant. The "lang=la-taxon" tag itself would
> tell the translation engine all it needs to know.
Well, translation engines - and localization tools - understand translate=no natively. You don’t need to teach them anything.
>
> Yes, I know, that needs support from the translation engine. But it
> has to handle many much more other interesting things already, so this
> case should be a piece of cake to add support for: "if it's la-taxon,
> skip it“.
I wouldn’t underestimate the effort. It took 4-5 years until there was agreement to add a „translate" attribute to HTML5.
Regards,
Felix
>
> Or maybe it does not even need extra support to be added. Say it is
> translating an English text into Hungarian, and it runs into an embedded
> "lang-fr" tag, what is it supposed to do? Not throw a fit, I hope...
> Either it pulls in a French-to-Hungarian dictionary, or, if it can't
> find one, it will not attempt to translate. And since it won't find a
> "Taxonomic Latin-to-Hungarian" dictionary, it won't translate. Voila.
>
> Luc
>
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages
More information about the Ietf-languages
mailing list