Machine Translation

Fri Sep 11 04:23:28 CEST 2009

Michael Everson <everson at evertype dot com> wrote:

> There is no place for a "machine"; this is an authorship tag, not 
> relevant to language identification.

That's exactly why John's proposal to make it an extension makes sense. 
It conveys information that may be useful in a language tag, may be 
relevant to the tagged content, but has nothing to do with language per 
se.

Mark Davis ? <mark at macchiato dot com> wrote:

> I think it is cleaner, simpler, and much more likely to be used if we 
> just have an additional variant tag, like "mactrans".

and Peter Constable <petercon at microsoft dot com> wrote:

> But what are the use scenarios? If the key scenarios are simply 
> providing an indicator to processes that might want to filter out MT 
> content, then an extension with all its additional machinery is 
> overkill; a single subtag "machxlat" is certainly sufficient.

I think our goal in deciding on a particular tagging mechanism should be 
which mechanism fits best, not which is easiest for us to implement, or 
guesses about one being more likely to be used by end users than the 
other.  For example, we added script subtags because we thought they 
would be more appropriate for representing scripts than region subtags 
(cf. "zh-TW" vs. "zh-CN").  We didn't scuttle the idea because users 
might have a hard time using script subtags, or because they would be 
overkill since most languages are not written in multiple scripts.

I thought Debbie's colleague(s) expressed the use case rather clearly. 
They want to be able to distinguish between Welsh written by a human and 
Welsh generated by a machine, and filter out the latter for one reason 
or another.  "Written by a human" and "generated by a machine" are not 
attributes of Welsh per se, and not even really attributes of the text, 
which is what I would expect a variant to represent.  They are 
meta-information.

Section 2.2.6, "Extension Subtags," says:

"Extensions provide a mechanism for extending language tags for use in 
various applications.  They are intended to identify information that is 
commonly used in association with languages or language tags but that is 
not part of language identification."

If this use case doesn't fit that description, then the goal is either 
to not encode "machine-translated" at all, or to make sure the extension 
mechanism is not used for anything, ever.

--
Doug Ewell  |  Thornton, Colorado, USA  |  http://www.ewellic.org
RFC 5645, 4645, UTN #14  |  ietf-languages @ http://is.gd/2kf0s