WiktionaryZ language codes
Gerard Meijssen
gerardm at wiktionaryz.org
Mon Nov 13 13:55:37 CET 2006
Hoi,
Yesterday I posted very much for the Wikimedia Foundation about issues
to do with the use of language codes at the Wikimedia Foundation. I am
digesting the results so far. I am also very grateful for the many
reactions. There are other issues but these will be for another e-mail.
This e-mail is only about the needs of the WiktionaryZ project.
FYI The WiktionaryZ project is currently not a Wikimedia Foundation project.
==WiktionaryZ project==
Introduction
The WiktionaryZ project is a project that came into being out of
frustration with the Wiktionary projects. Currently there are 170
Wiktionary projects and all aim to include all words of all languages.
WiktionaryZ aims to be able to include all information of all
Wiktionaries and make this information available to people of all
languages. Consequently, it can include both lexicological,
terminological and ontological information. The first parts of the user
interface are now in the UI as chosen by the user, or in English if we
do not have the content in their language or it is pot luck what is
provided. We are about to have the language names in the UI language
localised shortly; the code has been written, it just needs to be
committed to the life version. I have had the privilege to show the
functional data design to the professors Alan K. Melby and Sue Ellen
Wright in December 2005. It was a privilege; they questioned me during
two evenings and then told me they liked what they had seen.
Our content
Our first content was done by importing the EIONET GEMET thesaurus. A
new and growing community has been adding content and working on the
existing content. We are now at the verge where we will be able to
import other thesauri. It has taken us some nine months to grow from one
initial thesaurus to the point where we can have multiple thesauri or
collections in WiktionaryZ and not create utter chaos. The next thesauri
will be truly big. NB I write this to help you appreciate that this is a
serious project.
Standards
WiktionaryZ uses the ISO-639-3 list to indicate languages. For each
language we have or will be building a portal. We start with adding
content for a language when it is requested. A record indicating a
language, dialect or orthography is added to a table and it includes the
Wikimedia Foundation code and the ISO-639-3 code. We do distinguish
American and British English from English and use codes like eng-US. It
is obvious that we can add a column in this language table and have your
code in there as well.
The ISO-639-3 codes is the biggest list of languages that have some form
of recognition. It is therefore our best option to adopt this for our
initial list of languages. There are issues with several of these codes
but it allows for an operational start. It also has in Ethnologue a
maintainer that earned its reputation as an organisation with a strong
interest in linguistics. A "simple" single code allows people to find
their language. It also helps us to force people to understand that
WiktionaryZ does not have zh or zho and that we consequently do not
accept words to be registered in that way.
The IANA language subtag registry is not a list that would work for us
as is, because it is based solely on ISO-639-1 and ISO-639-2 and from
our perspective these just do not cut it. The registry is very much
incomplete and consequently it is not as usable as ISO-639-3. The format
the registry is in, does not help either. It is not ready for use in
databases. This is another reason that prevents the adoption of this
list as a resource.
Right, having "insulted" you all, let me bring the good news. As we are
creating the Expressions for the ISO-639-3 languages, we tag them with
their code as part of the ISO-639-3 collection. We are quite happy to
have a "IANA language subtag" collection whereby we advertise these
codes as well. Ideally every ISO-639-3 tagged language will have a IANA
language subtag. We do support ISO-15924. As WiktionaryZ extends
MediaWiki, we do support Unicode and at some stage we hope to support
the CLDR in our sorting.
In many of the answers to the Wikimedia Foundation mail, people referred
to things like a ISO-639-5 and 6. WiktionaryZ is able to have relations
between its "DefinedMeanings". I understand that ISO-639-6 has
relations, we can include these relations in WiktionaryZ. It can be an
unofficial playground for the content of this standard. It will however
be used for WiktionaryZ practically. Remember, WiktionaryZ aims to have
all words of all languages. Labelling words correctly is essential to
make WiktionaryZ useful. There are two aspects to this; recognition of a
specific orthography or dialect and having a good name for it. Another
aspect of orthography is, that there are different types of orthography.
There are official, constructed and used orthographies. Identifying a
text correctly helps us in making them part of a corpus for such a
resource.
What motivates us to go this route of connecting to the standards may be
found in the definition we have of success for our project: "Success is
when people find a use for our data that we did not think of ourselves".
By ensuring that we link to what is considered to be the/a standard we
make it easier for WiktionaryZ to be successful.
There are a few practical questions, let me quote from a request for the
addition of languages that can be edited in:
* mrc - Maricopa language. It has two main dialects, one in Salt
River and one in Gila River. Both use the same alphabet.
* ood - O'Odham language. It has three main dialects, Tohono
(Desert), Akimel (River), and Hia C-ed (Sand). Language has two
different alphabets in use -- one orthography, Saxton, is used
nowadays just for Akimel in Gila River (and in older literature
from other dialects). Alvarez-Hale orthography is used everywhere
else for every dialect (including Akimel in Salt River), also by
the most people.
* apw - Western Apache language. It has 3 dialects, White Mountain,
San Carlos, and Tonto. All use the same alphabet.
* nav - Navajo language. Very little dialect variation. Just one
alphabet.
At this moment we have only accepted the Navajo language as a language
to edit in. For the others (including dialects) we would like to know if
they have been recognised. And if not, what it takes to get what you
would consider appropriate codes. I am not sure what ISO ? standard
would deal with dialects.
Many languages like the Dutch language, have regular changes to the
official orthography. For the Dutch language this happens every ten
years. It is imho important to be able to tag text and words correctly
to the orthography used. Only indicating something as Dutch is
minimalistic and prevents the use of content for other purposes. It is
not clear to me how you deal with this. It would not be surprised when
this needs to be part of a standard too.
Thanks,
GerardM
Sources:
http://wiktionaryz.org
http://wiktionary.org
http://wiktionaryz.org/Category:Language_portals
http://www.eionet.europa.eu/gemet
http://www.iana.org/assignments/language-subtag-registry
http://wiktionaryz.org/DefinedMeaning
More information about the Ietf-languages
mailing list