Macrolanguages, countries & orthographies

CE Whitehead cewcathar at
Wed Feb 14 20:59:01 CET 2007

>Mark Davis wrote:
> > Assume that old Czech is as different from modern as fro is from fr.
>But is this a real problem?  How much total literature is written
>and available in different variations of Czech?  My prejudice says
>that as a nation with a language and literature of its own, Czech
>is about as young as Finnish, Norwegian or Serbian, i.e. 19th
>century.  Can you give any concrete examples when not having a
>separate *code* for pre-renaissance Czech is a practical problem?
>Linguists of course have *names* for Swedish of all ages, but I
>see no real use for having ISO or the IETF specify language
>*codes*.  I could be wrong, but if so please enlighten and correct
>me.  Nobody is going to translate OpenOffice or Mozilla to the
>language spoken by vikings (Old Norse) or the Swedish used during
>the Lutheran reformation (called New Swedish, ironically).
>Yes, there is now a branch of Wikipedia in Old English
>(, but that is a rare exception.  I don't expect
>this to happen in other languages.  Ang has now 744 articles,
>compared to the 11,000 articles of the Latin Wikipedia.
More old English:

There may also be some Czech texts to scan in; its written form dates to the 
14th century:

I think if it is a major modern language it is worth having a language tag 
for the historical variants;
if it is a language with an ancient literature it is also worth having a 
language tag for one or more historical forms (within reason enough to tag 
what is out there).
>I'm scanning old books, and I'm starting to see a practical
>problem with different orthographies and spelling reforms, similar
>to those addressed with the IETF defined codes for German de-1901
>and de-1996.  Analogous to these codes, we could perhaps find use
>for sv-1801, sv-1889, sv-1906, da-1775, da-1892 and da-1948,
>because we now have *significant amounts* of text online in each
>of these language versions. But before 1775/1801 the orthography
>of Swedish and Danish varies so heavily with each work, that it
>becomes pretty much useless to try to identify more versions.
>And before that time, there is also so small amounts of literature
>available, that any automatic processing becomes insignificant.

You are right, spelling was not strict and there are many varieties of 
language in use in older texts.
What is worth identifying is the distance the texts in these languages are 
from the modern I think.  (You can see how different the text in Beowulf is 
from modern English; it probably is harldy comprehsenible to most speakers, 
and in fact one might argue that it is about as close to Modern Danish as it 
is to modern English

Also, in some cases, such as in Old French for example, there is enough 
similarity from text to text though that there is a learnable language out 

(Incidentally,  the written Arabic inscriptions of 0 A.D. vary little from 
Modern Standard Arabic, but does modern spoken Arabic ever vary!  Which for 
me was overwhelming.)

--C. E. Whitehead
cewcathar at

>From predictions to trailers, check out the MSN Entertainment Guide to the 
Academy Awards®

More information about the Ietf-languages mailing list