WiktionaryZ language codes

Mon Nov 13 13:55:37 CET 2006

Hoi,
Yesterday I posted very much for the Wikimedia Foundation about issues 
to do with the use of language codes at the Wikimedia Foundation. I am 
digesting the results so far. I am also very  grateful for the many 
reactions. There are other issues but these will be for another e-mail. 
This e-mail is only about the needs of the WiktionaryZ project.

FYI The WiktionaryZ project is currently not a Wikimedia Foundation project.

==WiktionaryZ project==

Introduction
The WiktionaryZ project is a project that came into being out of 
frustration with the Wiktionary projects. Currently there are 170 
Wiktionary projects and all aim to include all words of all languages. 
WiktionaryZ aims to be able to include all information of all 
Wiktionaries and make this information available to people of all 
languages. Consequently, it can include both  lexicological, 
terminological and ontological information. The first parts of the user 
interface are now in the UI as chosen by the user, or in English if we 
do not have the content in their language or it is pot luck what is 
provided. We are about to have the language names in the UI language 
localised shortly; the code has been written, it just needs to be 
committed to the life version. I have had the privilege to show the 
functional data design to the professors Alan K. Melby and Sue Ellen 
Wright in December 2005. It was a privilege; they questioned me during 
two evenings and then told me they liked what they had seen.

Our content
Our first content was done by importing the EIONET GEMET thesaurus. A 
new and growing community has been adding content and working on the 
existing content. We are now at the verge where we will be able to 
import other thesauri. It has taken us some nine months to grow from one 
initial thesaurus to the point where we can have multiple thesauri or 
collections in WiktionaryZ and not create utter chaos. The next thesauri 
will be truly big. NB I write this to help you appreciate that this is a 
serious project.

Standards
WiktionaryZ uses the ISO-639-3 list to indicate languages. For each 
language we have or will be building a portal. We start with adding 
content for a language when it is requested. A record indicating a 
language, dialect or orthography is added to a table and it includes the 
Wikimedia Foundation code and the ISO-639-3 code. We do distinguish 
American and British English from English and use codes like eng-US. It 
is obvious that we can add a column in this language table and have your 
code in there as well.

The ISO-639-3 codes is the biggest list of languages that have some form 
of recognition. It is therefore our best option to adopt this for our 
initial list of languages. There are issues with several of these codes 
but it allows for an operational start. It also has in Ethnologue a 
maintainer that earned its reputation as an organisation with a strong 
interest in linguistics. A "simple" single code allows people to find 
their language. It also helps us to force people to understand that 
WiktionaryZ does not have zh or zho and that we consequently do not 
accept words to be registered in that way.

The IANA language subtag registry is not a list that would work for us 
as is, because it is based solely on ISO-639-1 and ISO-639-2 and from 
our perspective these just do not cut it. The registry is very much 
incomplete and consequently it is not as usable as ISO-639-3. The format 
the registry is in, does not help either. It is not ready for use in 
databases. This is another reason that prevents the adoption of this 
list as a resource.

Right, having "insulted" you all, let me bring the good news. As we are 
creating the Expressions for the ISO-639-3 languages, we tag them with 
their code as part of the ISO-639-3 collection. We are quite happy to 
have a "IANA language subtag" collection whereby we advertise these 
codes as well. Ideally every ISO-639-3 tagged language will have a IANA 
language subtag. We do support ISO-15924. As WiktionaryZ extends 
MediaWiki, we do support Unicode and at some stage we hope to support 
the CLDR in our sorting.

In many of the answers to the Wikimedia Foundation mail, people referred 
to things like a ISO-639-5 and 6. WiktionaryZ is able to have relations 
between its "DefinedMeanings". I understand that ISO-639-6 has 
relations, we can include these relations in WiktionaryZ. It can be an 
unofficial playground for the content of this standard. It will however 
be used for WiktionaryZ practically. Remember, WiktionaryZ aims to have 
all words of all languages. Labelling words correctly is essential to 
make WiktionaryZ useful. There are two aspects to this; recognition of a 
specific orthography or dialect and having a good name for it. Another 
aspect of orthography is, that there are different types of orthography. 
There are official, constructed and used orthographies. Identifying a 
text correctly helps us in making them part of a corpus for such a 
resource.

What motivates us to go this route of connecting to the standards may be 
found in the definition we have of success for our project: "Success is 
when people find a use for our data that we did not think of ourselves". 
By ensuring that we link to what is considered to be the/a standard we 
make it easier for WiktionaryZ to be successful.

There are a few practical questions, let me quote from a request for the 
addition of languages that can be edited in:

    * mrc - Maricopa language. It has two main dialects, one in Salt
      River and one in Gila River. Both use the same alphabet.
    * ood - O'Odham language. It has three main dialects, Tohono
      (Desert), Akimel (River), and Hia C-ed (Sand). Language has two
      different alphabets in use -- one orthography, Saxton, is used
      nowadays just for Akimel in Gila River (and in older literature
      from other dialects). Alvarez-Hale orthography is used everywhere
      else for every dialect (including Akimel in Salt River), also by
      the most people.
    * apw - Western Apache language. It has 3 dialects, White Mountain,
      San Carlos, and Tonto. All use the same alphabet.
    * nav - Navajo language. Very little dialect variation. Just one
      alphabet.

At this moment we have only accepted the Navajo language as a language 
to edit in. For the others (including dialects) we would like to know if 
they have been recognised. And if not, what it takes to get what you 
would consider appropriate codes. I am not sure what ISO ? standard 
would deal with dialects.

Many languages like the Dutch language, have regular changes to the 
official orthography. For the Dutch language this happens every ten 
years. It is imho important to be able to tag text and words correctly 
to the orthography used. Only indicating something as Dutch is 
minimalistic and prevents the use of content for other purposes. It is 
not clear to me how you deal with this. It would not be surprised when 
this needs to be part of a standard too.

Thanks,
     GerardM

Sources:
http://wiktionaryz.org
http://wiktionary.org
http://wiktionaryz.org/Category:Language_portals
http://www.eionet.europa.eu/gemet
http://www.iana.org/assignments/language-subtag-registry
http://wiktionaryz.org/DefinedMeaning