Reactions on the WiktionaryZ answers
gerardm at wiktionaryz.org
Tue Nov 14 00:52:42 CET 2006
WiktionaryZ is based on relational database technology. For some
Expressions we have more data then for others. "Water" has many
translations, synonyms, relations and it is part of at least two
collections. A language can be indicated with its ISO-639-3 but also
with its IANA language subtag. Both are correct, both are informative
and both serve their purpose. I would like to add the IANA language
subtags certainly for the languages we add content to as this can be
found on the Internet.. :)
I have learned a little about ISO-639-6. I had the privilege to see some
of its data. Some of the ISO-639-6 languages/dialects coincide with
requests for Wikimedia Foundation project codes. I was really happy with
what it said about the nap language; it indicated several dialects. What
impressed me most was that they were indicated as spoken Neapolitan. I
discussed this later with Sabine Cretella, who is bureaucrat and the
driving force of the nap.wikipedia, it confirmed what she knows about
her language. It was also interesting for me that this code will use
four character codes. This is what I suggested for the codes made up
within the Wikimedia Foundation some time ago to use in stead of making
up codes that can be mistaken for ISO-639 codes. Now I have learned that
we can ask for a IANA language subtag and expect a response within a
limited time frame which is something that I will propose for the WMF.
WiktionaryZ will include words that are from orthographies like the 1996
or the 2006 Dutch "Groene boekje" but also the competing "Witte
spelling". This level of granularity is as far as I can see currently
lacking from both IANA and ISO-639-6. One application that we envision
is to create spell checkers to be used when doing an OCR of old texts.
Has someone the data available under RFC 4646 available in something
like a spreadsheet ? It would be much easier for me to process. I hope
Frank's program to export to something like Access will materialise
publicly and freely soon. :)
I had a look at the Open Language Archive Community. One problem with
this resource is that it may be Open but it is not Free. Our data will
be available under a GFDL or a CC-by license. When the restrictions of
this community are such that it makes our data less Free, it sadly
cannot be used by us. I have not spend enough time yet to know where we
stand on this.
SIL maintains the ISO-639-3. ISO-639-3 includes many dead languages. I
am sure that there is not much experience requesting new languages from
SIL. By being the sole maintainer of this standard they will be
responsible in performing their task I imagine. Given the low number of
languages in ISO-639-2 I have little to go on that the expansion by
requesting new languages worked well.
WiktionaryZ is a lexicological resource. By saying that a word is zh, I
specify precious little about this word. It can be anything. There is
not much I can do with it. This is also implicit in it being recognised
as a macro language. "You have to give users what they want" sounds very
much like giving in to politics. Calling this imperialistic? Hell no, it
is more like empirical.
I do not fear that either the ISO-639-3 or the ISO-639-6 codes will
change. This is as likely as I expect this to happen for IANA subtags.
The ISO-639 documentation is quite clear about this. The worst that can
happen is that certain codes are depreciated and this happens to IANA
codes too. As I explained, including the IANA codes in WiktionaryZ as
well is feasible and worthwhile.
Including ISO-639-5 and ISO-639-6 type information is something that we
can do. This includes relating languages to their "language
collections". In WiktionaryZ this information will be humanly readable
as well. The information in relations will display differently depending
on the language selected for the User Interface. I understand that the
information for ISO-639-6 is not finished to be accepted as a standard.
However, from what I have been privileged to see, a lot of great
information is there. Information that is hard to find elsewhere. It is
not the same as registering for a subtag; in order to convince that a
request is reasonable you have to provide the arguments that have
already been made. From this perspective it is sad that the work done on
so many standards are done in secrecy/seclusion/isolation.
I thank the many people who read and specifically those who responded to
my WiktionaryZ post.
More information about the Ietf-languages