Hi -

> From: "Mark Davis" <mark.davis at icu-project.org>
> To: "Randy Presuhn" <randy_presuhn at mindspring.com>
> Cc: <ietf-languages at alvestrand.no>; <ltru at lists.ietf.org>
> Sent: Friday, April 13, 2007 1:04 PM
> Subject: Re: [Ltru] Re: "mis" update review request
> Another scenario is where you have incoming content, and you need to tag it
> for use by other components. This might be done, for example, in a search
> engine, where you fetch and process a page, and use that information later
> in doing searches. The tag serves to communicate language between the
> different components.
> In that case, you have far from perfect information about the content: what
> you have being typically the result of some level statistical analysis, plus
> other factors about the document. You need to tag with as much information
> as you have, *but no more*. It is in that case where you need to have the
> tags that indicate some level of imperfect knowledge about the source, such
> as "I have no idea what this is", or "It looks like linguistic content, but
> I don't know which language", or "it doesn't look like linguistic content".
> (You may also have more detailed knowledge, like that some document appears
> to have 70% English content (probability 95%) and 20% French content
> (probability 65%)).

I agree that groupings are useful, and have argued for them in the past.
I'm not going to argue that we should exclude "mis", but I doubt that
it would be terribly useful.

I find it very hard to believe that a reasonable analysis (whether done
by human or machine) would classify a text a being "mis" without being
able to recognize which of the languages in that grouping the text belonged
to.  I can believe someone could look at text and say "it's a slavic language,
but I'm not sure which one."  Do we really think someone or something would
look at some text and say "it's Ainu, Andamanese, or Etruscan, but I can't
tell which, so I'll tag it 'mis'"?


