[Ltru] Re: Ietf-languages Digest, Vol 50, Issue 15

Peter Constable petercon at microsoft.com
Fri Feb 16 16:20:41 CET 2007


The JAC has various issues I need to get them to review. I will include this among them.

There's one scenario I think is of interest: Suppose there are currently only a handful of documents in non-modern Czech (I don't know what the actual situation is) - not enough that anyone feels a particular need to request separate IDs for historical forms. What then? Well, I guess they would use "ces"; that's a little like someone wondering if a letterform is a distinct character and deciding to treat it like a font variant and not requesting a new character in Unicode/ISO 10646. But then suppose later someone discovers some lost library of Old and Middle Czech documents. What then? At that point, they'd probably request separate IDs for the historical form, and they might argue that this is splitting the historical variants from the single category - i.e. that would mean a new ID for the modern language and well as the historical languages, and "ces" would automatically become a macrolanguage.

Of course, the scenario playing out that way depends on users initially deciding to use "ces" for the historical records and not requesting separate IDs, and on the JAC deciding that there was established usage of "ces" for multiple variants. (By analogy, UTC wouldn't necessarily decide that the existing character was ambiguous and that *two* new, unambiguous characters are needed.)


Peter

________________________________
From: mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] On Behalf Of Mark Davis
Sent: Thursday, February 15, 2007 2:34 PM
To: Peter Constable
Cc: LTRU Working Group; ietf-languages at alvestrand.no
Subject: Re: [Ltru] Re: Ietf-languages Digest, Vol 50, Issue 15

I'm inclined towards #2 also, for the reasons you cite. My primary concern, however, is to get a definitive statement from the RA as to which of the policies is true. That, for me, is far more important than which policy is actually used.

Mark
On 2/15/07, Peter Constable <petercon at microsoft.com<mailto:petercon at microsoft.com>> wrote:

Adopting 1 would mean adopting generally across all of ISO 639-3: all entries of individual-language scope encompass corresponding historic varieties. But then, note that historic varieties are relevant only in cases of languages with a long literary tradition that is preserved. For instance, there may have been an old Naskapi that is distinct from the modern descendent, but there never have been and never will be any records in this putative language, so there is zero need for an identifier that encompasses it.



(Btw, please note: we do *not* code reconstructed protolanguages. ISO 639-3 is explicit about that. So please don't anybody suggest we'd be coding proto-Naskapi.)



So really we're just talking about some limited set of cases with a literary tradition.



Note that we're also only talking about cases in which languages were well-enough developed to maintain a single identify over several hundred years. That's what distinguishes a "historic" language from an "extinct" language. For instance, there are historic documents in a pre-Columbian Mixtec variety, but that language identity is not preserved by one specific modern Mixtec variety. And I reject the notion that pre-Columbian Mixtec together with all the modern Mixtec varieties is a macrolanguage unless someone makes the case that there's a user scenario in which it is appropriate to treat all those varieties as one language.



So the number of relevant cases is fairly constrained. I don't know just how many there would be, but it's going to be a small fraction of all modern languages for which this is relevant.



I have a concern with 1 that it would detract from interoperability, for the kinds of reasons Anthony mentions. There is a very large amount of usage in which "eng" is intended to mean specifically modern English, and a very large amount of usage in which "ces" is intended to mean specifically modern Czech. I don't see who it would help to decide that these IDs encompass Old English and Old Czech respectively: the average modern-language user isn't likely going to be cataloguing content in Old English and Old Czech, and they certainly aren't going to be helped by having queries return records in the historic varieties. As for the specialist, they certainly don't want to catalog content as all "eng" and "ces", as Anthony has made clear. The only scenario in which maybe someone is helped is when the specialist wants a query to return records for all historic varieties. I don't see why they can't use a Boolean operator for that, but even if there was enough need for a single ID, I wouldn't be inclined to use "eng' and "ces" for that purpose: that would be helping the 0.01% scenario at the detriment of 99.99% of users.



Thus, I'm inclined towards 2. There is certainly willingness in general on the part of the ISO 639 JAC to code historic languages, so I have no doubt that IDs for things like Old Czech etc. would be provided so long as the need is clear and there's a sense that the historic boundaries deemed appropriate by philologists, research librarians, etc. are appropriate.







Peter





________________________________

From: Mark Davis [mailto:mark.davis at icu-project.org<mailto:mark.davis at icu-project.org>]
Sent: Thursday, February 15, 2007 8:55 AM
To: Anthony Aristar
Cc: LTRU Working Group; ietf-languages at alvestrand.no<mailto:ietf-languages at alvestrand.no>
Subject: [Ltru] Re: Ietf-languages Digest, Vol 50, Issue 15



Your quotation below omits the true author, and may leave the impression that I wrote a number of paragraphs that I do not agree with and did not write. I only wrote "Assume that old Czech ..." -- someone else wrote the "But is this a real problem...."

> Mark Davis wrote:
>
> > Assume that old Czech is as different from modern as fro is from fr.
>
> But is this a real problem?  How much total literature is written
...

That being said, there are two models that ISO could be using.

 1.  Overlapping. 'eng' means any English, modern or historic. 'ang' means specifically Old English, a subset of 'eng'. 'ces' means any Czech. There is no tag specifically for Old Czech.

    *   so I could tag Beowulf with 'ang' or 'eng', but Shakespeare, Austen, and Robin Williams only with 'eng'.
    *   Smil Flaška z Pardubic and Václav Havel are both tagged with 'ces'.
    *   Requests for BCP 47 variant tags for Shakespearean English (en-SHAKESPR) or old Czech (cs-OLDCZECH) would be legitimate.
    *   A request for a variant tag for only modern English (en-MODENGL), thus excluding Old English, would be legitimate.

 1.  Disjoint. 'eng' means only modern English, 'ang' means Old English, 'ces' means only modern Czech. There is no tag at all (currently) for Old Czech.

    *   so I could tag Beowulf with 'ang' only.
    *   and there is no valid current code for tagging for Smil Flaška z Pardubic
    *   A request for BCP 47 variant tags for Shakespearean English (en-SHAKESPR) would be legitimate
    *   A request for a registered old Czech language tag (oldczech) would be legitimate. (However "primary languages are strongly RECOMMENDED for registration with ISO 639, and proposals rejected by ISO 639/RA will be closely scrutinized before they are registered with IANA." )

I don't think they are using model number one, but we need to find out.

Mark

On 2/15/07, Anthony Aristar < aristar at linguistlist.org<mailto:aristar at linguistlist.org>> wrote:

With all due respect, this seems like a very odd discussion from my
perspective  as a linguistics professor.  The discussion seems to
presuppose that all that matters is whether Microsoft is going to one
day produce a version of Word in Middle High German or Old English, or
how many texts exist in a language.

But the ISO 639 codes are used for much more than this.  In particular,
they are used to ensure interoperability, allowing material of the same
linguistic nature to be found in searches, and to be compared using the
linguistic ontologies that are now being developed.  If I am a scholar
searching for texts in Old English (or Old High German, for that
matter) and everyone has been cavalier enough to code such material
with eng and deu, what the search engines return will be utterly
useless to me.  I am going to be flooded with such a quantity of
material in Modern English and Modern German that searching through it
will be essentially impossible.

So if you really believe that it doesn't matter if you code English
material as eng, whatever its period, what you're really saying is that
you don't really care about interoperability, and that you don't really
care about scholarship.

                **************************************
Anthony Aristar, Director, Institute for Language Information & Technology
                   Professor of Linguistics
Moderator, LINGUIST               Principal Investigator, EMELD Project
Linguistics Program
Dept. of English                  aristar at linguistlist.org<mailto:aristar at linguistlist.org>
Eastern Michigan University            2000 Huron River Dr, Suite 104
Ypsilanti, MI 48197
U.S.A.

URL: http://linguistlist.org/aristar/
                **************************************

> Mark Davis wrote:
>
> > Assume that old Czech is as different from modern as fro is from fr.
>
> But is this a real problem?  How much total literature is written
> and available in different variations of Czech?  My prejudice says
> that as a nation with a language and literature of its own, Czech
> is about as young as Finnish, Norwegian or Serbian, i.e. 19th
> century.  Can you give any concrete examples when not having a
> separate *code* for pre-renaissance Czech is a practical problem?
>
> Linguists of course have *names* for Swedish of all ages, but I
> see no real use for having ISO or the IETF specify language
> *codes*.  I could be wrong, but if so please enlighten and correct
> me.  Nobody is going to translate OpenOffice or Mozilla to the
> language spoken by vikings (Old Norse) or the Swedish used during
> the Lutheran reformation (called New Swedish, ironically).
>
> Yes, there is now a branch of Wikipedia in Old English
> ( ang.wikipedia.org<http://ang.wikipedia.org>), but that is a rare exception.  I don't expect
> this to happen in other languages.  Ang has now 744 articles,
> compared to the 11,000 articles of the Latin Wikipedia.





_______________________________________________
Ietf-languages mailing list
Ietf-languages at alvestrand.no<mailto:Ietf-languages at alvestrand.no>
http://www.alvestrand.no/mailman/listinfo/ietf-languages



--
Mark

_______________________________________________
Ltru mailing list
Ltru at ietf.org<mailto:Ltru at ietf.org>
https://www1.ietf.org/mailman/listinfo/ltru



--
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/ietf-languages/attachments/20070216/45da50ee/attachment-0001.html


More information about the Ietf-languages mailing list