Principles of Operation (was LANGUAGE SUBTAG REQUESTFORMErzgebirgisch)

Addison Phillips addison at yahoo-inc.com
Fri Jan 25 17:30:14 CET 2008


Hm... well, while we're making sweeping generalizations... *English* 
typically isn't tagged or tagged correctly today either. Although there 
is evidence of improvement in this area. Hopefully developers will pay 
attention to BCP 47 and incorporate the ability to tag arbitrary 
languages into applications (I'm glad that MediaWiki will).

When you say "[search engines] do not support all languages", that's 
only partially true. It is true that the major search engines do not 
have specific lexical analyzers and indexes for a vast number of 
languages. However, there is pretty good support for doing proper token 
extraction in most *scripts*. Thus your search for either Maltese or 
Macedonian text is likely to work.

What you can't do today is say "search for text only in Maltese".

Note that adding a language isn't trivial either. For one thing, you 
have to have a large body of well-identified, highly-representative text 
in that language (to build statistics) in order to detect the language 
in the first place.

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.


Gerard Meijssen wrote:
> Hoi,
> Google does not support all languages. It does not support Maltese or 
> Macedonian. Both official languages of European countries. As 
> applications typically only support the languages that they have been 
> localised for, they do not allow you to indicate that your language is 
> for instance Maltese, Lower Saxon, Piedmontese ... Consequently texts, 
> materials in those languages will not be tagged by their applications 
> and it will be hard to find content in those languages.
> 
> It is for this reason why I find it vital for the languages and 
> orthographies supported by MediaWiki to have a proper code. By making 
> the code explicitly part of the package, we can get content out on the 
> Internet that is properly coded. Without it, content will be genuinely 
> hard to find.
> Thanks,
>      Gerard
> 
> On Jan 25, 2008 12:02 PM, Frank Ellermann <nobody at xyzzy.claranet.de 
> <mailto:nobody at xyzzy.claranet.de>> wrote:
> 
>     David Starner wrote:
> 
>      > Why does it matter whether it's en-caesarea or ang-caesarea
>      > except to linguists? Those details should be hidden from
>      > end users and are in most cases.
> 
>     I have "en-GB", "en", and "en-us" in the language preferences
>     of my browser, and with a "quick locale switcher" tool I can
>     pick what I want for tests.  So far my browser, this tool, and
>     I never considered to add "ang" or "sxu".  I had to configure
>     "frr" manually for tests (no effect so far, e.g. "frr" isn't
>     in the various lists of languages supported by Google).
> 
>     It depends on what the requester wants.  If it's for research
>     or other "any unique tag will do" purposes a linguistically
>     correct but otherwise obscure prefix is fine.
> 
>     Clearly using "ang" (or "sxu") for a research project showing
>     that this is actually wrong would be odd.  OTOH if it is meant
>     to help all speakers of the dialect that their Web content is
>     "supported" in various ways, then using less obscure prefixes
>     "en" (or "de") is better.
> 
>      Frank
> 
>     _______________________________________________
>     Ietf-languages mailing list
>     Ietf-languages at alvestrand.no <mailto:Ietf-languages at alvestrand.no>
>     http://www.alvestrand.no/mailman/listinfo/ietf-languages
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages




More information about the Ietf-languages mailing list