Language attributes- what are they?

Fri Dec 31 16:04:59 CET 2004

> From: Tex Texin [mailto:tex at xencraft.com]

> I need to consider it in more detail, but a couple of quick comments-
> 
> 1) Sorting is not always immediately evident, it depends on the data in
> the
> set, the amount of data, and which sort orders are being considered as
> likely.
> If I don't know if a list is sorted traditional vs modern spanish, or even
> French vs. German, I have to identify particular data items that are
> distinct
> and note where they are or aren't placed. It is usually deducible, but not
> what
> I would call self-evident in all cases.

Let me ask a couple of questions. Suppose I present a sorted list of hypothetical words in this email:

aab
aba
baa

There's not enough there for you to deduce whether I sorted this following English or French or German or Danish or etc. order. So what? Why do you need to deduce one but not another of these? And how would you *use* that?

My contention is that, once the document is created, it doesn't matter to you the reader whether I used this or that order since that's not going to affect what you see or how you interpret it. And I don't really expect that you're going to run some parser on my document that depends on knowing the order -- if you need the list to be in a particular order, your going to order it yourself.

I assume here that you simply received my document but had no control over its creation. Of course, there can be a scenario in which you are requesting data from my server and want it to appear in a particular order. That's clearly an API scenario in which a locale ID would be passed -- you may also be wanting times to be formatted in a particular way, for instance.

> 2) For my limited knowledge of linguistics, "writing system" is broad and
> covers a lot of ground. I agree it belongs on the list but at least for me
> it
> includes a number of considerations that perhaps should be broken down
> further.
> I would have lumped date order for example in with writing system.

By date order you mean date format? I definitely would not lump those. There's a very big difference between the common English writing system using Latin script, which is used everywhere English is written, and date formats used by English speakers. 

As I view "writing system", most distinctions can be made by specifying a script in addition to identifying the language. The exceptions are distinct ways of writing using the same script, such as English in phonetic transcription versus the common orthography, or the common English orthography versus a hypothetical alternative used after the revolution comes in which there are many systematic difference such that, say, "teeth" and "tithe" are instead written "tiþ" and "taið" (with changes involving different characters rather than simply different spellings).

Where the boundary on "writing system" is uncertain for me would be with things like conventions for hyphenation or use of quotation marks. But something like date formats are IMO well outside the boundary. I sometimes write "12/31" and sometimes "31/12", but I'd say I don't ever change my writing system. I think a "writing system" is something that is generally very stable for a given individual, and even speaker communities (except when undergoing a transition).

> 3) Your test for locale vs language involving whether an API is needed
> seems
> suspect to me.
> If I am given a document with 3/4/5 and told it is English US, or even if
> I
> assume the origin, then it tells me (generally) how I should read (parse)
> it to
> understand it. I don't think it is just an api issue, anymore than knowing
> whether "chat" is talk or cat.

I think the difference between a metadata element declaring properties of content and a parameter used to configure the operational mode of a process wrt cultural conventions is quite significant, and there are lots of things that may be relevant for the latter that are not for the former. But the question of whether date format is in scope for the former is not an unreasonable question to ask.

First off, suppose the document I send you (e.g. this email) contains date strings. I think we can agree that you really don't care whether they were generated manually or by a software process; all you care about is how to interpret them. 

Now, manually, I can use all kinds of formats, and either through carelessness or by design mix them in the same document. If I'm discussing someone's birthday and write "1/4/05" here but "4/1/05" there, in principle I could tag the different runs of text with some metadata indicating how the date string should be interpreted, but what author is ever going to do that? And what reader is ever going to check that metadata? I might also on another line write "Janu. 4 '05" -- are we going to create a registry for every ad hoc format that might get used?

And are you really going to run a parser on the stuff I enter into a document manually? I doubt that's a common scenario. On the other hand, a very common scenario would be that you're requesting data from my server that you intend to parse, and you either need date strings to be in a particular format or you want to be told what the format is. That's an API: your process interacting with my process.

Even if there were some scenario in which it made sense to have metadata declaring date formats on static content, I would say that that is a special case for which distinct metadata elements should be used; I would not encumber general-purpose language tags with date-format distinctions.

I'm sticking to my position: on the one hand, we need metadata elements to declare linguistic attributes of static content, and on the other we need to control the operational mode of processes wrt a variety of culture-related (and possibly user-preference) variables; and the things that are relevant for the former, in common scenarios, are much narrower than those relevant for the latter, along the lines I gave in my earlier message.

Peter Constable