Language attributes- what are they?

Fri Dec 31 01:17:36 CET 2004

> From: Tex Texin [mailto:tex at xencraft.com]

> I am surprised that sort order would not be considered part of
language...

Obviously sort order is something related to language. But I maintain
that distinguishing sort orders is out of scope for language tags. The
reason is that the purpose of a language tag is to declare linguistic
properties of information objects, and there are no common usage
scenarios in which we need to declare a sort order.

Consider this message. It (and lots more like it) doesn't have any
sorted content, so there is no need to declare a sort order. Even if I
were to insert a sorted list (e.g. able < baker < charlie < ... ), there
wouldn't be any reason to declare that "any sorted listed contained
herein are sorted in the following way" since the way they are sorted is
self evident.

The only situations I can think of in which a declaration of sort order
would be a useful piece of metadata on an information object are:

- A static document comprising a sorted list. E.g. suppose we are
serving up a Munich telephone directory, are offering two different sort
orders to users, and maintain the directory in the two orders as two
static files rather than a database. In that case, it would be
appropriate to declare the sorted order of each file. But that's a
pretty exceptional scenario, and I see no reason for us to encumber
language tags with that.

- A resource used for collation of data.

In the latter case, the appropriate tagging is not with a "language" ID
but with a "locale" ID. Here it's important to understand the
distinction between "language" IDs and "locale" IDs. Whereas a
"language" ID is primarily a metadata element used to declare properties
of content, a "locale" ID is a parameter passed through an API to
configure the operational mode of some process for various
culture-dependent variables (some linguistic, some not). Most of the
time when we're stating a sort order, it's in an API, specifying how
strings should be compared; and just as it makes sense to label a
resource controlling various processes like how date, number and time
strings are formatted using a locale ID, so also it makes sense to label
a resource used for collation with a locale ID.

So, I have come to believe that sorting should relate to identifiers on
the basis of common usage scenarios, and that points to including
sorting distinctions in locale IDs that are passed through APIs but not
in language tags declaring properties of content.

> I have also been wanting to ask about date formats. I would have
thought
> that
> date formats are more a locale issue, but some people have insisted
that
> language determines date format, in particular ordering of year, month
and
> day,
> and that it is not a function of locale. Personally, I wouldn't think
> dates are
> linguistic, since some languages use more than one format (Japan, and
> Japanese
> for example).

Well, clearly dates can include a linguistic aspect given that date
strings may include spelled-out names for months or days of the week.
But again I think in terms of usage scenarios and what the function of
"language" IDs is versus that of "locale" IDs. I don't think there are
usage scenarios in which it would make sense to declare the date format
of a document. Date formats specify how automated processes convert a
date/time data value in some machine representation into a string. Once
it's in a document as a string, then it simply is what it is, and no
matter what the document's metadata may say about it, the only way to be
certain how to interpret something like "3/4/05" is to determine how it
got generated. We don't declare date formats of content; what we *do* do
is pass a parameter into an API that controls how a numeric date/time
value will be converted into a string. Again, this is relevant for a
locale ID (the thing we pass through APIs to control the operational
mode of some process), not a "language" (the metadata element we use to
declare linguistic properties of content).

Now, it may be that the information provided by a language ID is enough
to indicate a specific culture and, thus, a particular date format. For
instance, "en-US" as a language tag is intended to indicate that content
is in English using American vocabulary and spelling, but it may also be
adequate for use as a locale parameter in an API to obtain mm/d/yy date
formatting. There may be many cases in which the qualifiers needed to
distinguish a locale are the same as those that would be used to declare
linguistic properties of content. I consider that coincidental, however:
there is still a conceptual distinction between a "language" ID and a
"locale" ID.

> Is it possible to identify and list which attributes are appropriate
> considerations for association by language, and perhaps some of the
ones
> that
> might be mistaken for attributes but are not. Perhaps such a list will
> help
> establish the appropriateness of which tag to use.

For language tags, I'd say that the things we generally want to declare
as linguistic properties of content may include:

- language
- sub-language variety
- writing system
- spelling conventions

One borderline item is typographic conventions, but it's my impression
that the above items are generally sufficient to determine these. (This
is for things like e.g. that Serbian Cyrillic uses different glyphs from
Russian for the italic of certain letters, not e.g. that the content is
formatted using a sans serif face with historic ligatures.)

Some things we specifically should not declare within a language tag
(IMO) are number formats, date/time formats, default currency symbols,
calendrical systems, sort orders and character encodings. (Of course,
there are many other things.)

For locales, an identifier needs to uniquely identify, and in principle
a pair of locales may differ in any one parameter. That means that any
of the above items may need to be part of a locale ID. E.g. consider the
default English/US locale settings as a point of reference, and let me
use "en-US" to refer to that; any of the following are possible distinct
locales:

en_US but with "," as the decimal separator and no other separator
en_US but with 24-hour time
en_US but with times formatted as "hhmmss" (no delimiters)
en_US but with "yyyy-mm-dd" as a short date format
en_US but with the Hebrew calendrical system
en_US but with the euro symbol as the default currency symbol
en_US but with A4 as a default paper size

Of course, the open question is what the limits of "locale" are, a
question I know you're well acquainted with (having discussed it
together a few years ago on the locales list -- for others: I gave paper
size as an example that pushes the limits, but there are no established
conventional boundaries between familiar things like date formats and
oddities such as systems for measuring shoe sizes or even any kind of
user preference).

Peter Constable