RE: draft-phillips-langtags-08, process, sp ecifications, "stability", and extensions

Thu Dec 30 22:37:35 CET 2004

> From: JFC (Jefsey) Morfin [mailto:jefsey at jefsey.com]

> >Of course it would not be clear if you don't have a conceptual model of
> >what "language" tags are identifiers *of*. When RFC 3066 was being
> >developed, there was a suggestion that script IDs be incorporated, but
> >some were reluctant, raising the same question you have here. I was one
> of
> >those. But I didn't remain obstructionist over the issue; instead, I gave
> >a fair amount of thought to the ontology that underlies "language" tags,
> >and subsequently published a white paper and presented on the topic at
> two
> >conferences in the spring and fall of 2002. (Paper is available online at
> >http://www.sil.org/silewp/abstract.asp?ref=2002-003 -- my thinking has
> >evolved since then, but some key results remain valid, I think.)
> 
> May us know which ones?

It would be easier to identify two key points on which my thinking has changed.

IIRC, I was uncertain at the time about what to do wrt sorting. I have since concluded that sort order is a presentation issue that, while linguistically related, is out of scope for language identifiers. (Note that there is no common usage scenario in which it makes sense to declare the sorted order of content.) Sort order may certainly be in scope for a locale identifier, but not for a "language" tag.

The bigger change is that I have abandoned the fourth main category in the ontological model I proposed. At the time, I was still trying to work out where something like "Latin America Spanish" fit in. I saw the similarity to sub-language varieties / dialects, but at the time thought it needed to be a distinct category, for which reason I concocted the notion "domain-specific data set". 

I was never very satisfied with that: it wasn't a particularly consistent model (a data set is quite a different kind of thing from a language variety) and it ignored the similarity with sub-language variety. (And the name was a bit unwieldy.) 

I have since realized that I was tripping up on the very problem that was blocking the Language Tag Reviewer from accepting the requested registration for "es-americas": the assumption that a language tag necessarily refers to a conventionally-recognized linguistic identity that exists in the world. Language tags are not attributes declared on language varieties; they are attributes declared on information objects, indicating linguistic properties of those information objects. And the linguistic attributes of an information object do not necessarily coincide with conventionally-recognized linguistic identities. Of course, in the majority of useful cases they will; but it's not hard to show that this is not always the case: e.g. if I present "chat" as an expression that could be intrepreted in relation to several different languages, it would be entirely appropropriate for me to declare a linguistic attribute of that expression of "indeterminate" since that is precisely my intent -- but clearly "indeterminate" doesn't correspond with any particular language identity out in the world.

Thus, I came to realize that the kind of distinction intended by "es-americas" was just the same kind of distinction made for any sub-language variety: it declares that the information object is not only in some particular language, but is even more constrained in terms of the language variety in use. It is simply coincidental that the more constrained usage in this case doesn't coincide with a single dialect used by some identifiable speaker community.

Peter Constable