[Ltru] Re: Ietf-languages Digest, Vol 50, Issue 15

Peter Constable petercon at microsoft.com
Thu Feb 15 21:47:53 CET 2007


Adopting 1 would mean adopting generally across all of ISO 639-3: all entries of individual-language scope encompass corresponding historic varieties. But then, note that historic varieties are relevant only in cases of languages with a long literary tradition that is preserved. For instance, there may have been an old Naskapi that is distinct from the modern descendent, but there never have been and never will be any records in this putative language, so there is zero need for an identifier that encompasses it.

(Btw, please note: we do *not* code reconstructed protolanguages. ISO 639-3 is explicit about that. So please don’t anybody suggest we’d be coding proto-Naskapi.)

So really we’re just talking about some limited set of cases with a literary tradition.

Note that we’re also only talking about cases in which languages were well-enough developed to maintain a single identify over several hundred years. That’s what distinguishes a “historic” language from an “extinct” language. For instance, there are historic documents in a pre-Columbian Mixtec variety, but that language identity is not preserved by one specific modern Mixtec variety. And I reject the notion that pre-Columbian Mixtec together with all the modern Mixtec varieties is a macrolanguage unless someone makes the case that there’s a user scenario in which it is appropriate to treat all those varieties as one language.

So the number of relevant cases is fairly constrained. I don’t know just how many there would be, but it’s going to be a small fraction of all modern languages for which this is relevant.

I have a concern with 1 that it would detract from interoperability, for the kinds of reasons Anthony mentions. There is a very large amount of usage in which “eng” is intended to mean specifically modern English, and a very large amount of usage in which “ces” is intended to mean specifically modern Czech. I don’t see who it would help to decide that these IDs encompass Old English and Old Czech respectively: the average modern-language user isn’t likely going to be cataloguing content in Old English and Old Czech, and they certainly aren’t going to be helped by having queries return records in the historic varieties. As for the specialist, they certainly don’t want to catalog content as all “eng” and “ces”, as Anthony has made clear. The only scenario in which maybe someone is helped is when the specialist wants a query to return records for all historic varieties. I don’t see why they can’t use a Boolean operator for that, but even if there was enough need for a single ID, I wouldn’t be inclined to use “eng’ and “ces” for that purpose: that would be helping the 0.01% scenario at the detriment of 99.99% of users.

Thus, I’m inclined towards 2. There is certainly willingness in general on the part of the ISO 639 JAC to code historic languages, so I have no doubt that IDs for things like Old Czech etc. would be provided so long as the need is clear and there’s a sense that the historic boundaries deemed appropriate by philologists, research librarians, etc. are appropriate.



Peter


________________________________
From: Mark Davis [mailto:mark.davis at icu-project.org]
Sent: Thursday, February 15, 2007 8:55 AM
To: Anthony Aristar
Cc: LTRU Working Group; ietf-languages at alvestrand.no
Subject: [Ltru] Re: Ietf-languages Digest, Vol 50, Issue 15

Your quotation below omits the true author, and may leave the impression that I wrote a number of paragraphs that I do not agree with and did not write. I only wrote "Assume that old Czech ..." -- someone else wrote the "But is this a real problem...."

> Mark Davis wrote:
>
> > Assume that old Czech is as different from modern as fro is from fr.
>
> But is this a real problem?  How much total literature is written
...

That being said, there are two models that ISO could be using.

 1.  Overlapping. 'eng' means any English, modern or historic. 'ang' means specifically Old English, a subset of 'eng'. 'ces' means any Czech. There is no tag specifically for Old Czech.

    *   so I could tag Beowulf with 'ang' or 'eng', but Shakespeare, Austen, and Robin Williams only with 'eng'.
    *   Smil Flaška z Pardubic and Václav Havel are both tagged with 'ces'.
    *   Requests for BCP 47 variant tags for Shakespearean English (en-SHAKESPR) or old Czech (cs-OLDCZECH) would be legitimate.
    *   A request for a variant tag for only modern English (en-MODENGL), thus excluding Old English, would be legitimate.

 1.  Disjoint. 'eng' means only modern English, 'ang' means Old English, 'ces' means only modern Czech. There is no tag at all (currently) for Old Czech.

    *   so I could tag Beowulf with 'ang' only.
    *   and there is no valid current code for tagging for Smil Flaška z Pardubic
    *   A request for BCP 47 variant tags for Shakespearean English (en-SHAKESPR) would be legitimate
    *   A request for a registered old Czech language tag (oldczech) would be legitimate. (However "primary languages are strongly RECOMMENDED for registration with ISO 639, and proposals rejected by ISO 639/RA will be closely scrutinized before they are registered with IANA." )
I don't think they are using model number one, but we need to find out.

Mark
On 2/15/07, Anthony Aristar < aristar at linguistlist.org<mailto:aristar at linguistlist.org>> wrote:
With all due respect, this seems like a very odd discussion from my
perspective  as a linguistics professor.  The discussion seems to
presuppose that all that matters is whether Microsoft is going to one
day produce a version of Word in Middle High German or Old English, or
how many texts exist in a language.

But the ISO 639 codes are used for much more than this.  In particular,
they are used to ensure interoperability, allowing material of the same
linguistic nature to be found in searches, and to be compared using the
linguistic ontologies that are now being developed.  If I am a scholar
searching for texts in Old English (or Old High German, for that
matter) and everyone has been cavalier enough to code such material
with eng and deu, what the search engines return will be utterly
useless to me.  I am going to be flooded with such a quantity of
material in Modern English and Modern German that searching through it
will be essentially impossible.

So if you really believe that it doesn't matter if you code English
material as eng, whatever its period, what you're really saying is that
you don't really care about interoperability, and that you don't really
care about scholarship.

                **************************************
Anthony Aristar, Director, Institute for Language Information & Technology
                   Professor of Linguistics
Moderator, LINGUIST               Principal Investigator, EMELD Project
Linguistics Program
Dept. of English                  aristar at linguistlist.org<mailto:aristar at linguistlist.org>
Eastern Michigan University            2000 Huron River Dr, Suite 104
Ypsilanti, MI 48197
U.S.A.

URL: http://linguistlist.org/aristar/
                **************************************

> Mark Davis wrote:
>
> > Assume that old Czech is as different from modern as fro is from fr.
>
> But is this a real problem?  How much total literature is written
> and available in different variations of Czech?  My prejudice says
> that as a nation with a language and literature of its own, Czech
> is about as young as Finnish, Norwegian or Serbian, i.e. 19th
> century.  Can you give any concrete examples when not having a
> separate *code* for pre-renaissance Czech is a practical problem?
>
> Linguists of course have *names* for Swedish of all ages, but I
> see no real use for having ISO or the IETF specify language
> *codes*.  I could be wrong, but if so please enlighten and correct
> me.  Nobody is going to translate OpenOffice or Mozilla to the
> language spoken by vikings (Old Norse) or the Swedish used during
> the Lutheran reformation (called New Swedish, ironically).
>
> Yes, there is now a branch of Wikipedia in Old English
> ( ang.wikipedia.org<http://ang.wikipedia.org>), but that is a rare exception.  I don't expect
> this to happen in other languages.  Ang has now 744 articles,
> compared to the 11,000 articles of the Latin Wikipedia.





_______________________________________________
Ietf-languages mailing list
Ietf-languages at alvestrand.no<mailto:Ietf-languages at alvestrand.no>
http://www.alvestrand.no/mailman/listinfo/ietf-languages



--
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/ietf-languages/attachments/20070215/e76ddc81/attachment-0001.html


More information about the Ietf-languages mailing list