WiktionaryZ language codes

Mon Nov 13 18:01:27 CET 2006

Gerard Meijssen scripsit:

> The ISO-639-3 codes is the biggest list of languages that have some form 
> of recognition. It is therefore our best option to adopt this for our 
> initial list of languages. There are issues with several of these codes 
> but it allows for an operational start. It also has in Ethnologue a 
> maintainer that earned its reputation as an organisation with a strong 
> interest in linguistics. A "simple" single code allows people to find 
> their language. It also helps us to force people to understand that 
> WiktionaryZ does not have zh or zho and that we consequently do not 
> accept words to be registered in that way.

Feel free to use *draft* ISO 639-3 codes if (a) you are willing to accept
the risk that they might change and (b) you don't care about backward
compatibility with existing uses of language tags.

The cost of mitigating (b) is fairly small:  (1) use ISO 639-1 codes
when available, and (2) prefix each code for a language that is encompassed
by a macrolanguage with the 639-1 or 639-3 code for that macrolanguage.

> The IANA language subtag registry is not a list that would work for us 
> as is, because it is based solely on ISO-639-1 and ISO-639-2 and from 
> our perspective these just do not cut it. The registry is very much 
> incomplete and consequently it is not as usable as ISO-639-3. 

We know.  We can't accept either (a) or (b) above, given our highly
conservative position on never changing a code.  Our only recourse
is patience.

> We are quite happy to 
> have a "IANA language subtag" collection whereby we advertise these 
> codes as well. Ideally every ISO-639-3 tagged language will have a IANA 
> language subtag. 

It will, and the differences will be trivial: see above.

> In many of the answers to the Wikimedia Foundation mail, people referred 
> to things like a ISO-639-5 and 6. 

639-5 is about collections of languages: it will be a superset of the
language collection codes in 639-2.  As such, it probably isn't relevant
to your work.  639-5 codes will be drawn from the same 3-alpha code space
as 639-2 and 639-3.

ISO 639-6 will provide codes for language collections, languages, and
language varieties, drawn from a 4-alpha code space.

> I understand that ISO-639-6 has 
> relations, we can include these relations in WiktionaryZ. 

639-6 will provide a single hierarchy organizing its various collections,
languages, and variants.  This hierarchy is explicitly not stabilized
and is subject to change.

> For the others (including dialects) we would like to know if 
> they have been recognised. And if not, what it takes to get what you 
> would consider appropriate codes. I am not sure what ISO ? standard 
> would deal with dialects.

ISO 639-6 will when it is published, some time in 2007 or 2008.
You can register variant subtags for dialects with IETF right now.

> Many languages like the Dutch language, have regular changes to the 
> official orthography. For the Dutch language this happens every ten 
> years. It is imho important to be able to tag text and words correctly 
> to the orthography used. Only indicating something as Dutch is 
> minimalistic and prevents the use of content for other purposes. It is 
> not clear to me how you deal with this. It would not be surprised when 
> this needs to be part of a standard too.

Same answer.

-- 
No,  John.  I want formats that are actually       John Cowan
useful, rather than over-featured megaliths that   http://www.ccil.org/~cowan
address all questions by piling on ridiculous      cowan at ccil.org
internal links in forms which are hideously
over-complex. --Simon St. Laurent on xml-dev