Unilingua

Wed Sep 21 15:02:28 CEST 2005

Caoimhin O Donnaile wrote:

> Presumably as more people tag attach language tags to their web
> pages, as I have just done, Google will have more seed texts to
> get started with.

Unfortunately, many pages are incorrectly tagged.  But a sufficiently 
sophisticated statistical model can attempt to model the probability of 
this, perhaps even with respect to a particular web site, and reject 
such documents.  That is, any explicit language tags associated with a 
page (both internally and via HTTP headers) should be modeled as 
possibly noisy indicators of the language, just as the content of the 
page is such a noisy indicator.

> However, it seems that with a good program very little seed
> text is actually required.  Kevin Scannell with his program
> "An Crúbadán" which trawls the Internet using the Google API:
>
>    http://borel.slu.edu/crubadan/
>
> has been very successful in corpus building for minority languages.

This is very interesting work - thanks for the pointer.

- John Burger
   MITRE