Here's what I have to say aboutthat?

Mon May 26 10:55:00 CEST 2003

Jon Hanna wrote on 05/26/2003 06:57:33 AM:

> 1. The ability to concisely encode locale information in an
> architecture-neutral and somewhat human-readable way is needed by
> many applications.
> 2. RFC 3066 fulfils some of the requirements of such a mechanism,
> but not all of them.
>
> I think we have consensus on those two points.

Indeed, but the current suggestions are not an attempt to use RFC 3066 to
encode everything traditionally associated with "locale". It's an attempt
to encode things that are properly associated with "language". We need to
identify not only languages as they exist out in the wild, but also
*expression* of languages within electronic data. Expression of a given
language can have variant forms, and those variations are centrally
important in the ability to process data. For that reason, it is sensible
to group together the identity of both the language and which of its
various modes of expression is used.

> Where we don't have
> consensus is on how to proceed.

As Martin Duesrt indicated, we're close enough.

> The majority opinion seems to favour altering the use of 3066 (some
> point to registrations like de-1901 as evidence that this isn't
> really altering 3066 at all).

I'd go further: sanctioned use of tags like "en-US" vs. "en-GB", where the
most significant difference is one of spelling, establishes that this isn't
altering the intent of 3066 at all.

> This majority is split on the details of this, in particular with
> respect to "default" scripts.

That is the only point on which there seems to be a significant difference
of opinion. For all but one of Mark's requests, it is not an issue.

> The minority (which I *think* seems to be myself with Michael being
> sympathetic to my position but not in complete agreement either) do
> not think 3066 can scale to this requirement without *considerable*
> alteration. To me the "default script" issue seems the result of
> pushing something further than it can naturally go

It is the result of trying to think of how best to extend the system in a
way that is consistent with the intent of current practice. For instance,
the suggesting of the notion of a default script was motivated by wondering
if the *vast* majority if not all current uses of "en" for textual data
imply not only the English language but also "as written with the common
Latin-based English alphabet". As a proposal, it may turn out there are
very good reasons why we should not adopt this notion of some language IDs
implying a default written form, but that's a separate issue to the
question of whether a single tag can be used to indicate both a language
and the particular written expression of that language.

> I maintain that script is orthogonal to language.

Language and country are just as much if not more orthogonal, but we
regularly combine those in a single tag.

> What's more I
> maintain that it is orthogonal *as a practical matter*. The
> necessity of discovering, expressing and storing script and language
> is orthogonal, with many applications only caring about one of
> those, and very different mechanisms can allow us to determine
> and/or guess at language and at script, making explicit
> identification of little importance in some cases.

Here, there are very good counterarguments. For current implementations,
the need to express or discover language only, apart from variation in
written expression, is limited since most IT implementations are processing
textual language data. Also, just as different mechanisms can identify
language alone vs. script alone on inspecting a sequence <body
xml:lang="az" xml:script=latn">, so also different mechanisms can make the
same determinations from a sequence <body xml:lang="az-latn"> (or, for that
matter, from a sequence <body xml:lang="en"> if it is decided to assume a
set of documented defaults).

Moreover, it is very much a practical matter that existing implementations
are designed to process a single tag related to language and the form of
its written expression. I agree that it would be possible to create
distinct metadata categories for language and the form of written
expression thereof, but that would require creating and deploying revised
versions of many implementations. It is not clear that any potential gains
from having distinct metadata categories would outweigh the cost of
developing, the risk that deployment of new systems would not be
widespread, not to mention the problems that would arise in dealing with
large volumes of existing data and software implementaitons that do not use
separate metadata categories but, generally, already have identify of
written expression already implied or explicitly indicated in the single
existing metadata category.

> I want the three things to be trivial:
>
> 1. Comparing languages,
> 2. Comparing scripts,
> 3. Comparing language and script combinations.
>
> I see these three as things people are going to need. Is there any
> dispute about this?

The need for the third is far greater than the need for the first two. We
need to have systems that make the common things easy, and other things
possible. The way in which we're expanding on already-existing practice for
RFC 3066 does that, and at far less cost than would be involved in creating
new systems using two distinct metadata categories.

> I further foresee a need for other information about stuff that goes
> hand-in-hand with the concept of "locale" (a problematic word, but
> I'll forgo spending 10 paragraphs debating what it means, and for
> now define it as "a representation of conventions used when
> rendering data for human consumption, when or parsing human-readable
> documents").

Which has been discussed here and elsewhere, and I think there is complete
agreement that there are many things of this sort that we do *not* want RFC
3066 extended to accommodate.

> Someone mentioned the use of 3066 to infer human readable date
> formats, currencies and currency symbol usage, etc. I think we
> generally agreed that this wasn't a terribly good idea, but we have
> to accept that in the absence of other mechanisms people are going
> to do exactly that, just like people are now using 3066 to infer
> script information (though at least script is more tightly bound to
language).

Inferring it is one thing; extending the RFC to overtly indicate such
things is quite another. Some people are going to make inferences or this
sort simply upon encountering characters from particular blocks of Unicode,
or domain names ending with certain country IDs. None of these assumptions
are safe in general.

> My back-of-an-envelope strawman is to define a new locale specifier
> in which the locale of this email is "en-IE.latn".

> 2. While not backward-compatible with 3066, 3066 is forwards
> compatible with it.

I gather your suggesting that the first portion in your proposed tag up to
"." is taken from RFC3066. I'm curious: what is the practical difference
between "en-IE" and "en" -- what should my software do differently?

I have the following problem with this proposal: you are suggesting a
system that uses and extended RFC 3066 to cover written expression, but
also other things related to locale, things unrelated to identity of a
language or its written expression. There are many protocols that reference
RFC 3066 for usage scenarios in which identification of written form of a
language (as well as identification of the language) is both appropriate
and needed, but in which identification of other "locale"-related
parameters is not. Using RFC 3066 to indicate both language and written
form of language does that with little or no additional effort, while your
proposal does not.

You might suggest, then, three systems: RFC 3066, the system for complete
"locale" specification, and an intermediate one for the combination of
language & script that would use tags such as "az=AZ.latn"; or perhaps that
your proposing only this latter kind of system, and not one for complete
"locale" specification. At this point, I would make a few observations:

- A tag such as "az-AZ.latn" is not all that different from "az-AZ-latn".

- The mechanisms you wanted for easily determining language, script, and
language&script that would work on a tag such as "en-IE.latn" are only
trivially different from mechanisms that would make such determination from
tags such as "en-IE-latn" or "en-latn-IE".

- One of Peter Edberg's suggestions for extending RFC 3066 did include a
distinct delimiter in the syntax to separate the purely-langage portion
from the written-expression portion, and it seems clear that such
conventions could be handled as an extension to RFC 3066 rather than
creating a new, distinct system (and, for reasons I gave above, should be).

- Some have argued here that the sequencing of elements within a tag should
put script before country, i.e. "en-latn-IE" (or if you contend the need
for distinct delimters, "en.latn.IE" or some such) since differences in
orthographic system / script are far more important than are differences in
spelling conventions or vocabulary and other dialect distinctions.

> 4. Allows for the separation of responsibility - the management of
> language tags would not necessarily be done by the same people as
> the management of script tags, or any other features added by the
> extensibility mechanism provided. This is of importance both as a
> matter of scalability and also because some people might simply only
> find some of those matters to be interesting.

On the contrary, I think we'd probably find largely the same group of
people involved in both.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485