Principles of Operation (was LANGUAGE SUBTAG REQUESTFORMErzgebirgisch)

Gerard Meijssen gerard.meijssen at gmail.com
Fri Jan 25 18:45:31 CET 2008


Hoi,
A huge investment is needed in order to identify the language content is in.
The best place to for the identification and inclusion of correct language
tagging is at the source. Software is currently REALLY bad at this,
particularly for the less and least resourced languages. I fear that this
will not improve unless actions are taken to improve language tagging at the
source.

It is organisations like Google, Microsoft, Yahoo that can make a
difference. When they announce for their global search engines that they
will positively discriminate in favour of content that is properly tagged as
to its language, they will find that this will create a HUGE incentive to
publishers to ensure that their tags are set correctly. It will put pressure
on the developers of software to make sure that recognised linguistic
entities will be supported. And it will put pressure on the standard
organisations to create labels for all the missing languages, dialects and
orthographies.

When this happens in 2008, the Unesco year of languages,  it would be a
great thing.

Thanks,
    Gerard

On Jan 25, 2008 5:30 PM, Addison Phillips <addison at yahoo-inc.com> wrote:

> Hm... well, while we're making sweeping generalizations... *English*
> typically isn't tagged or tagged correctly today either. Although there
> is evidence of improvement in this area. Hopefully developers will pay
> attention to BCP 47 and incorporate the ability to tag arbitrary
> languages into applications (I'm glad that MediaWiki will).
>
> When you say "[search engines] do not support all languages", that's
> only partially true. It is true that the major search engines do not
> have specific lexical analyzers and indexes for a vast number of
> languages. However, there is pretty good support for doing proper token
> extraction in most *scripts*. Thus your search for either Maltese or
> Macedonian text is likely to work.
>
> What you can't do today is say "search for text only in Maltese".
>
> Note that adding a language isn't trivial either. For one thing, you
> have to have a large body of well-identified, highly-representative text
> in that language (to build statistics) in order to detect the language
> in the first place.
>
> Addison
>
> --
> Addison Phillips
> Globalization Architect -- Yahoo! Inc.
> Chair -- W3C Internationalization Core WG
>
> Internationalization is an architecture.
> It is not a feature.
>
>
> Gerard Meijssen wrote:
> > Hoi,
> > Google does not support all languages. It does not support Maltese or
> > Macedonian. Both official languages of European countries. As
> > applications typically only support the languages that they have been
> > localised for, they do not allow you to indicate that your language is
> > for instance Maltese, Lower Saxon, Piedmontese ... Consequently texts,
> > materials in those languages will not be tagged by their applications
> > and it will be hard to find content in those languages.
> >
> > It is for this reason why I find it vital for the languages and
> > orthographies supported by MediaWiki to have a proper code. By making
> > the code explicitly part of the package, we can get content out on the
> > Internet that is properly coded. Without it, content will be genuinely
> > hard to find.
> > Thanks,
> >      Gerard
> >
> > On Jan 25, 2008 12:02 PM, Frank Ellermann <nobody at xyzzy.claranet.de
> > <mailto:nobody at xyzzy.claranet.de>> wrote:
> >
> >     David Starner wrote:
> >
> >      > Why does it matter whether it's en-caesarea or ang-caesarea
> >      > except to linguists? Those details should be hidden from
> >      > end users and are in most cases.
> >
> >     I have "en-GB", "en", and "en-us" in the language preferences
> >     of my browser, and with a "quick locale switcher" tool I can
> >     pick what I want for tests.  So far my browser, this tool, and
> >     I never considered to add "ang" or "sxu".  I had to configure
> >     "frr" manually for tests (no effect so far, e.g. "frr" isn't
> >     in the various lists of languages supported by Google).
> >
> >     It depends on what the requester wants.  If it's for research
> >     or other "any unique tag will do" purposes a linguistically
> >     correct but otherwise obscure prefix is fine.
> >
> >     Clearly using "ang" (or "sxu") for a research project showing
> >     that this is actually wrong would be odd.  OTOH if it is meant
> >     to help all speakers of the dialect that their Web content is
> >     "supported" in various ways, then using less obscure prefixes
> >     "en" (or "de") is better.
> >
> >      Frank
> >
> >     _______________________________________________
> >     Ietf-languages mailing list
> >     Ietf-languages at alvestrand.no <mailto:Ietf-languages at alvestrand.no>
> >     http://www.alvestrand.no/mailman/listinfo/ietf-languages
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Ietf-languages mailing list
> > Ietf-languages at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/ietf-languages
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/ietf-languages/attachments/20080125/d4fc4ad3/attachment.html


More information about the Ietf-languages mailing list