Language attributes- what are they?

Tex Texin tex at xencraft.com
Fri Dec 31 01:40:20 CET 2004


Peter,

Thanks very much for this. I think you make a good stab at an answer and I
suspect it will trigger some debate and perhaps some additional work for you to
defend, so I appreciate your taking it on.

I need to consider it in more detail, but a couple of quick comments-

1) Sorting is not always immediately evident, it depends on the data in the
set, the amount of data, and which sort orders are being considered as likely.
If I don't know if a list is sorted traditional vs modern spanish, or even
French vs. German, I have to identify particular data items that are distinct
and note where they are or aren't placed. It is usually deducible, but not what
I would call self-evident in all cases.

2) For my limited knowledge of linguistics, "writing system" is broad and
covers a lot of ground. I agree it belongs on the list but at least for me it
includes a number of considerations that perhaps should be broken down further.
I would have lumped date order for example in with writing system.

3) Your test for locale vs language involving whether an API is needed seems
suspect to me.
If I am given a document with 3/4/5 and told it is English US, or even if I
assume the origin, then it tells me (generally) how I should read (parse) it to
understand it. I don't think it is just an api issue, anymore than knowing
whether "chat" is talk or cat.

I'll give it some more thought. I would be glad for others comments.
tex


Peter Constable wrote:
> 
> > From: Tex Texin [mailto:tex at xencraft.com]
> 
> > I am surprised that sort order would not be considered part of
> language...
> 
> Obviously sort order is something related to language. But I maintain
> that distinguishing sort orders is out of scope for language tags. The
> reason is that the purpose of a language tag is to declare linguistic
> properties of information objects, and there are no common usage
> scenarios in which we need to declare a sort order.
> 
> Consider this message. It (and lots more like it) doesn't have any
> sorted content, so there is no need to declare a sort order. Even if I
> were to insert a sorted list (e.g. able < baker < charlie < ... ), there
> wouldn't be any reason to declare that "any sorted listed contained
> herein are sorted in the following way" since the way they are sorted is
> self evident.
> 
> The only situations I can think of in which a declaration of sort order
> would be a useful piece of metadata on an information object are:
> 
> - A static document comprising a sorted list. E.g. suppose we are
> serving up a Munich telephone directory, are offering two different sort
> orders to users, and maintain the directory in the two orders as two
> static files rather than a database. In that case, it would be
> appropriate to declare the sorted order of each file. But that's a
> pretty exceptional scenario, and I see no reason for us to encumber
> language tags with that.
> 
> - A resource used for collation of data.
> 
> In the latter case, the appropriate tagging is not with a "language" ID
> but with a "locale" ID. Here it's important to understand the
> distinction between "language" IDs and "locale" IDs. Whereas a
> "language" ID is primarily a metadata element used to declare properties
> of content, a "locale" ID is a parameter passed through an API to
> configure the operational mode of some process for various
> culture-dependent variables (some linguistic, some not). Most of the
> time when we're stating a sort order, it's in an API, specifying how
> strings should be compared; and just as it makes sense to label a
> resource controlling various processes like how date, number and time
> strings are formatted using a locale ID, so also it makes sense to label
> a resource used for collation with a locale ID.
> 
> So, I have come to believe that sorting should relate to identifiers on
> the basis of common usage scenarios, and that points to including
> sorting distinctions in locale IDs that are passed through APIs but not
> in language tags declaring properties of content.
> 
> > I have also been wanting to ask about date formats. I would have
> thought
> > that
> > date formats are more a locale issue, but some people have insisted
> that
> > language determines date format, in particular ordering of year, month
> and
> > day,
> > and that it is not a function of locale. Personally, I wouldn't think
> > dates are
> > linguistic, since some languages use more than one format (Japan, and
> > Japanese
> > for example).
> 
> Well, clearly dates can include a linguistic aspect given that date
> strings may include spelled-out names for months or days of the week.
> But again I think in terms of usage scenarios and what the function of
> "language" IDs is versus that of "locale" IDs. I don't think there are
> usage scenarios in which it would make sense to declare the date format
> of a document. Date formats specify how automated processes convert a
> date/time data value in some machine representation into a string. Once
> it's in a document as a string, then it simply is what it is, and no
> matter what the document's metadata may say about it, the only way to be
> certain how to interpret something like "3/4/05" is to determine how it
> got generated. We don't declare date formats of content; what we *do* do
> is pass a parameter into an API that controls how a numeric date/time
> value will be converted into a string. Again, this is relevant for a
> locale ID (the thing we pass through APIs to control the operational
> mode of some process), not a "language" (the metadata element we use to
> declare linguistic properties of content).
> 
> Now, it may be that the information provided by a language ID is enough
> to indicate a specific culture and, thus, a particular date format. For
> instance, "en-US" as a language tag is intended to indicate that content
> is in English using American vocabulary and spelling, but it may also be
> adequate for use as a locale parameter in an API to obtain mm/d/yy date
> formatting. There may be many cases in which the qualifiers needed to
> distinguish a locale are the same as those that would be used to declare
> linguistic properties of content. I consider that coincidental, however:
> there is still a conceptual distinction between a "language" ID and a
> "locale" ID.
> 
> > Is it possible to identify and list which attributes are appropriate
> > considerations for association by language, and perhaps some of the
> ones
> > that
> > might be mistaken for attributes but are not. Perhaps such a list will
> > help
> > establish the appropriateness of which tag to use.
> 
> For language tags, I'd say that the things we generally want to declare
> as linguistic properties of content may include:
> 
> - language
> - sub-language variety
> - writing system
> - spelling conventions
> 
> One borderline item is typographic conventions, but it's my impression
> that the above items are generally sufficient to determine these. (This
> is for things like e.g. that Serbian Cyrillic uses different glyphs from
> Russian for the italic of certain letters, not e.g. that the content is
> formatted using a sans serif face with historic ligatures.)
> 
> Some things we specifically should not declare within a language tag
> (IMO) are number formats, date/time formats, default currency symbols,
> calendrical systems, sort orders and character encodings. (Of course,
> there are many other things.)
> 
> For locales, an identifier needs to uniquely identify, and in principle
> a pair of locales may differ in any one parameter. That means that any
> of the above items may need to be part of a locale ID. E.g. consider the
> default English/US locale settings as a point of reference, and let me
> use "en-US" to refer to that; any of the following are possible distinct
> locales:
> 
> en_US but with "," as the decimal separator and no other separator
> en_US but with 24-hour time
> en_US but with times formatted as "hhmmss" (no delimiters)
> en_US but with "yyyy-mm-dd" as a short date format
> en_US but with the Hebrew calendrical system
> en_US but with the euro symbol as the default currency symbol
> en_US but with A4 as a default paper size
> 
> Of course, the open question is what the limits of "locale" are, a
> question I know you're well acquainted with (having discussed it
> together a few years ago on the locales list -- for others: I gave paper
> size as an example that pushes the limits, but there are no established
> conventional boundaries between familiar things like date formats and
> oddities such as systems for measuring shoe sizes or even any kind of
> user preference).
> 
> Peter Constable
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex at XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------



More information about the Ietf-languages mailing list