639 coding wrt historic varieties (was RE: Request for variant subtag fr 16th-c 17th-c Resubmitted!)

Thu Jan 11 13:55:51 CET 2007

From: mark.edward.davis at gmail.com [mailto:mark.edward.davis at gmail.com] On Behalf Of Mark Davis
Sent: Monday, December 18, 2006 6:35 PM

Mark: Sorry for the delay in responding to this.

> Peter, I'm a little bit fuzzy on where the lines were 
> drawn with historic versions of the same language in ISO. 
> This is relevant to the LTRU group for ISO 639-3, so am 
> cc'ing that group.

Entities coded in 639-1 (at least, any coded in 639-1 prior to 639-2) were probably done for terminology work, which would almost certainly mean they were done for modern forms of languages.

The historic varieties coded in 639-2 probably originated in MARC and hence were done on the basis of the practice of librarians (and those primarily in Western nations). I don't know on what basis they chose the boundaries that they did. 

> I take it from your discussion that "fr" means *only* 
> modern French, and that if I want to have a tag for any 
> French, modern or not, I would have to use (fr OR frm 
> OR fro). Similarly, if I wanted any English, I would 
> have to use (en OR enm OR ang). 

That's my understanding. 

Since 639-1/-2 only provide English and French names as a guide to the semantics of each coded entity, it isn't always clear from a given entry what it means. To gain a more complete idea of the semantics of entries, there are a couple of things we can look at: what distinctions are made within the coded set, and how a given ID been used. (For the latter, MARC is particularly relevant since it was the source from which most of the entities coded in 639-2.)

As I mentioned above, "fr" was originally used by terminologists, in which context it very likely has meant the modern variety only. The corresponding alpha-3 ID "fre" was originally used by librarians. While in that context it's more likely that the ID potentially could have been used in a way that encompassed historic varieties, the contrastive coded entities "frm" and "fro" that they also used clearly suggests that that was probably not how they used it. The MARC Code List for Languages (http://www.loc.gov/marc/languages/) is consistent with that: in describing how "fre" is used, they document it as encompassing varieties from different regions, but not historic varieties.

Now, I admit the question of whether a coded entity like "fr" could encompass historic varieties had not occurred to me until it came up recently in ietf-languages / LTRU (whichever it was). And my guess is that it has never been considered by the JAC as well: the idea that an ID might be used for multiple varieties certainly existed in MARC, but it appears that that was never done in MARC with historic varieties; and apart from that, the JAC had no occasion to discuss an ID encompassing multiple varieties (except for the obvious case of collections) until ISO/CD 639-3 introduced macrolanguages. But given the distinctions made in the code set and the usage of 639-1/-2 IDs by terminologists and librarians, I would conclude that ISO 639 currently does not have identifiers that are intended to encompass varieties from multiple time depths.

> My question is how this is managed over time in ISO, 
> since there are significant implications for language tags.

It's a good question, and one I'm sure hasn't been considered by the JAC.

> Let's take Czech, for example, where we only currently have 
> 'cs'. I see the following possibilities. 
>
> 1. This means only Modern Czech.
> That implies that there is no code for Old Czech, so if I 
> want to tag something with that, I need to petition ISO 
> for an language tag for Old Czech (let's say 'ceu'). Once 
> that is added, I can refer to Old Czech. 
>
> 2. This currently means any Czech, but ISO may introduce 
> a code for Old Czech (let's say 'ceu'). 

As I mentioned above, I think for use of 639-1 in terminology probably only modern varieties would have been relevant. As for 639-2, the fact that the practice has been to differentiate between historic varieties could be seen to suggest that IDs refer specifically to modern varieties unless otherwise noted. On the other hand, it's easy to imagine that there are users out there (including librarians) that might have done otherwise.

> I see three possible approaches:
>
> 2a. The denotation of 'cs' is changed to mean only modern 
> Czech. This would be a breaking change for the language 
> subtag registry, since the meaning of a subtag would be 
> narrowed, invalidating any tags that had a broad 
> application. This would be rather disturbing, since we are 
> guaranteeing stability. 

If "cs" were deemed to encompass historic Czech varieties as well as modern, then this would be a breaking change for users in general, and so would not be a good idea.

> 2b. The denotation of 'cs' remains "any" Czech, and to get 
> only modern Czech I would need to use (cs AND NOT ceu). Note 
> while OR can be handled with a list, as per RFC 4647, AND 
> NOT cannot. This, however, would not break stability. 

Of course, if "cs" encompasses all historic varieties, then just as an ID for Old Czech can be added, and ID for specifically Modern Czech can also be added.

> 2c. The denotation of 'cs' remains "any" Czech, ceu becomes 
> an extlang. This I see as the least unpleasant outcome, but 
> I can't tell if whether this would be the ISO policy.

I think if "cs" were deemed to encompass historic varieties, then effectively that makes it like a macrolanguage entity, even if each of the individual languages (modern, middle, etc.) are not currently coded. So then, if one of the historic varieties does get coded in 639, then it would make complete sense to treat that as extlang in 4646bis.

Of course, the crucial question is whether "cs" (or "hi" or "ka" or ...) is intended to mean only the modern language or is intended to encompasses modern and historic varieties. This is an open question that could be put to the JAC. 

I'm somewhat inclined to say that all the coded entities in 639 should be understood to be the modern variety unless explicitly indicated otherwise. Given the vast majority of existing usage of IDs and the existing coding practice of distinguishing historic varieties from modern ones, that would maintain the greatest level of consistency. I think it would be confusing for users if the historic varieties were treated in different ways for different cases. Of course, there will probably be situations in which some user uses an ID for something that arguably is a distinct historic variety and should have a different ID (but no separate ID exists). But I think that when there is a significant need for an ID for a historic variety, users can request them, and the JAC is likely to grant them. I'm open to other ideas on this, though.

Peter Constable