Reactions on the WiktionaryZ answers

Gerard Meijssen gerardm at wiktionaryz.org
Tue Nov 14 00:52:42 CET 2006


Hoi,
WiktionaryZ is based on relational database technology. For some 
Expressions we have more data then for others. "Water" has many 
translations, synonyms, relations and it is part of at least two 
collections. A language can be indicated with its ISO-639-3 but also 
with its IANA language subtag. Both are correct, both are informative 
and both serve their purpose. I would like to add the IANA language 
subtags certainly for the languages we add content to as this can be 
found on the Internet.. :)

I have learned a little about ISO-639-6. I had the privilege to see some 
of its data. Some of the ISO-639-6 languages/dialects coincide with 
requests for Wikimedia Foundation project codes. I was really happy with 
what it said about the nap language; it indicated several dialects. What 
impressed me most was that they were indicated as spoken Neapolitan. I 
discussed this later with Sabine Cretella, who is bureaucrat and the 
driving force of the nap.wikipedia, it confirmed what she knows about 
her language. It was also interesting for me that this code will use 
four character codes. This is what I suggested for the codes made up 
within the Wikimedia Foundation some time ago to use in stead of making 
up codes that can be mistaken for ISO-639 codes. Now I have learned that 
we can ask for a IANA language subtag and expect a response within a 
limited time frame which is something that I will propose for the WMF.

WiktionaryZ will include words that are from orthographies like the 1996 
or the 2006 Dutch "Groene boekje" but also the competing "Witte 
spelling". This level of granularity is as far as I can see currently 
lacking from both IANA and ISO-639-6. One application that we envision 
is to create spell checkers to be used when doing an OCR of old texts.

Has someone the data available under RFC 4646 available in something 
like a spreadsheet ? It would be much easier for me to process. I hope 
Frank's program to export to something like Access will materialise 
publicly and freely soon. :)

I had a look at the Open Language Archive Community. One problem with 
this resource is that it may be Open but it is not Free. Our data will 
be available under a GFDL or a CC-by license. When the restrictions of 
this community are such that it makes our data less Free, it sadly 
cannot be used by us. I have not spend enough time yet to know where we 
stand on this.

SIL maintains the ISO-639-3. ISO-639-3 includes many dead languages. I 
am sure that there is not much experience requesting new languages from 
SIL. By being the sole maintainer of this standard they will be 
responsible in performing their task I imagine. Given the low number of 
languages in ISO-639-2 I have little to go on that the expansion by 
requesting new languages worked well.

WiktionaryZ is a lexicological resource. By saying that a word is zh, I 
specify precious little about this word. It can be anything. There is 
not much I can do with it. This is also implicit in it being recognised 
as a macro language. "You have to give users what they want" sounds very 
much like giving in to politics. Calling this imperialistic? Hell no, it 
is more like empirical.

I do not fear that either the ISO-639-3 or the ISO-639-6 codes will 
change. This is as likely as I expect this to happen for IANA subtags. 
The ISO-639 documentation is quite clear about this. The worst that can 
happen is that certain codes are depreciated and this happens to IANA 
codes too. As I explained, including the IANA codes in WiktionaryZ as 
well is feasible and worthwhile.

Including ISO-639-5 and ISO-639-6 type information is something that we 
can do. This includes relating languages to their "language 
collections". In WiktionaryZ this information will be humanly readable 
as well. The information in relations will display differently depending 
on the language selected for the User Interface. I understand that the 
information for ISO-639-6 is not finished to be accepted as a standard. 
However, from what I have been privileged to see, a lot of great 
information is there. Information that is hard to find elsewhere. It is 
not the same as registering for a subtag; in order to convince that a 
request is reasonable you have to provide the arguments that have 
already been made. From this perspective it is sad that the work done on 
so many standards are done in secrecy/seclusion/isolation.

I thank the many people who read and specifically those who responded to 
my WiktionaryZ post.
Thanks,
      Gerard

Sources:
http://wiktionaryz.org/Expression:water
http://nap.wikipedia.org/wiki/Utente:SabineCretella
http://www.hetgroeneboekje.nl/
http://www.wittespelling.nl/
http://www.gnu.org/copyleft/fdl.html
http://creativecommons.org/licenses/by/2.0/
http://www.google.com/search?sourceid=navclient-ff&ie=UTF-8&rls=GGGL,GGGL:2006-18,GGGL:en&q=define%3Aempirical


More information about the Ietf-languages mailing list