Caoimhin O Donnaile
caoimhin at smo.uhi.ac.uk
Tue Sep 20 23:16:12 CEST 2005
> >> Google's been using such a tool for years; it does not expose its
> >> tags, but allows you to search for them (for a limited list of
> >> languages).
> Currently about 35 - more than I expected. More would be better, of
> course, but all of the successful approaches to automatic language ID
> of which I am aware make use of statistical models trained from example
> texts - often substantial amounts are necessary.
> So I suspect that the set of languages that these tools will
> successfully ID will always be a (small?) fraction of what's actually
> out there. And Google will never tag Unilingua documents
> automatically. :)
Presumably as more people tag attach language tags to their web
pages, as I have just done, Google will have more seed texts to
get started with.
However, it seems that with a good program very little seed
text is actually required. Kevin Scannell with his program
"An Crúbadán" which trawls the Internet using the Google API:
has been very successful in corpus building for minority languages.
He has so far built corpuses for about 160 languages - in most
cases the best corpus in existence for the language. He writes:
"Initially a small collection of seed texts are fed to the
crawler (a few hundred words of running text have been sufficient
More information about the Ietf-languages