Machine Translation

Sat Sep 12 18:09:07 CEST 2009

I agree.

Peter

From: felix.sasaki at googlemail.com [mailto:felix.sasaki at googlemail.com] On Behalf Of Felix Sasaki
Sent: Saturday, September 12, 2009 12:06 AM
To: Kent Karlsson
Cc: Peter Constable; ietf-languages at iana.org
Subject: Re: Machine Translation

Of course one could define several machine translation related extensions, but at some point the suitability of language tags is really in question. E.g. how would you represent a bleu score as an extension? The point is that proper metadata for evaluation of machine translation soon goes beyond a simple "can be represented as a closed set of strings" pattern.

Felix
2009/9/11 Kent Karlsson <kent.karlsson14 at comhem.se<mailto:kent.karlsson14 at comhem.se>>
What you say is correct for a (single) variant subtag, as initially suggested, but extension subtags
work differently. See http://tools.ietf.org/search/rfc5646#section-2.2.6. Data like that you refer
to can be put in the part that follows the extention "singleton".

Note also that section 2.2.6 starts:

"Extensions provide a mechanism for extending language tags for use in
   various applications.  They are intended to identify information that
   is commonly used in association with languages or language tags but
   that is not part of language identification.
"
        /kent k

Den 2009-09-11 18.35, skrev "Felix Sasaki" <felix.sasaki at fh-potsdam.de<http://felix.sasaki@fh-potsdam.de>>:
I would agree with Yves Savourel that for translation tool developers, this kind of information is better provided via a different field. Other practical information which one could not pack into a broad data category "machine translation" easily (to use Peter's terminology), but not easily in the "language tag" field would be: name of system that generated the translation (maybe several ones where used ...), quality of the input, quality rating of the system (e.g. BLEU score). IMO these fine grained differences are necessary for making use of this kind of metadata, and I don't see a clear use case for a broad "machine translated" sub tag.

Felix

2009/9/11 Kent Karlsson <kent.karlsson14 at comhem.se<http://kent.karlsson14@comhem.se>>

Den 2009-09-11 17.32, skrev "Peter Constable" <petercon at microsoft.com<http://petercon@microsoft.com>>:

> From: ietf-languages-bounces at alvestrand.no<http://ietf-languages-bounces@alvestrand.no>
> [mailto:ietf-languages-bounces at alvestrand.no] On Behalf Of Felix Sasaki
>
>> There is a difference in the case of XLIFF. If the extension subtag is just
>> similar,
>> but not identical to MT related information in other technologies like XLIFF,
>> you
>> will end up with a mess of *values*. This is IMO different from the script
>> subtag
>> case: Here you have the same values, but different *occurences*
>
> Expressed with different terminology: you end up with a mess of data
> categories; in the script subtag case, you have a single data category with
> many values.

I don't think that should be a major issue. XLIFF, and other formats having
separate attributes for this, could simply have that attribute take
priority, even to the extent that "language extensions", in particular one
that overlaps with an attribute, can be completely ignored in those formats.

        /kent k

_______________________________________________
Ietf-languages mailing list
Ietf-languages at alvestrand.no<http://Ietf-languages@alvestrand.no>
http://www.alvestrand.no/mailman/listinfo/ietf-languages

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/ietf-languages/attachments/20090912/f2330f3b/attachment.htm