Caoimhin O Donnaile caoimhin at smo.uhi.ac.uk
Tue Sep 20 23:16:12 CEST 2005

> >> Google's been using such a tool for years; it does not expose its 
> >> tags, but allows you to search for them (for a limited list of 
> >> languages).
> Currently about 35 - more than I expected.  More would be better, of 
> course, but all of the successful approaches to automatic language ID 
> of which I am aware make use of statistical models trained from example 
> texts - often substantial amounts are necessary.
> So I suspect that the set of languages that these tools will 
> successfully ID will always be a (small?) fraction of what's actually 
> out there.  And Google will never tag Unilingua documents 
> automatically. :)

Presumably as more people tag attach language tags to their web
pages, as I have just done, Google will have more seed texts to
get started with.

However, it seems that with a good program very little seed
text is actually required.  Kevin Scannell with his program
"An Crúbadán" which trawls the Internet using the Google API:


has been very successful in corpus building for minority languages.
He has so far built corpuses for about 160 languages - in most
cases the best corpus in existence for the language.  He writes:
"Initially a small collection of seed texts are fed to the
crawler (a few hundred words of running text have been sufficient
in practice)."


More information about the Ietf-languages mailing list