John D. Burger
john at mitre.org
Wed Sep 21 15:02:28 CEST 2005
Caoimhin O Donnaile wrote:
> Presumably as more people tag attach language tags to their web
> pages, as I have just done, Google will have more seed texts to
> get started with.
Unfortunately, many pages are incorrectly tagged. But a sufficiently
sophisticated statistical model can attempt to model the probability of
this, perhaps even with respect to a particular web site, and reject
such documents. That is, any explicit language tags associated with a
page (both internally and via HTTP headers) should be modeled as
possibly noisy indicators of the language, just as the content of the
page is such a noisy indicator.
> However, it seems that with a good program very little seed
> text is actually required. Kevin Scannell with his program
> "An Crúbadán" which trawls the Internet using the Google API:
> has been very successful in corpus building for minority languages.
This is very interesting work - thanks for the pointer.
- John Burger
More information about the Ietf-languages