Question on ISO-639:1988

Fri Jun 4 12:29:41 CEST 2004

Peter,

The main thrust of your comments here seem to be:
	1. Show where / how Linguasphere used and the benefits
	2. Lack of confidence in proposing Linguasphere to IETF due to size and lack 
of satisfactory answer to 1.

Identifying these main points will certainly focus the necessary efforts
in appropriate directions, and thank you for taking the time to make them.
I'm relatively new to this language coding thing, so will take quite some time
to catch up yet.  

> it has been comments made on several occasions that (i) would seem to 
> suggest the BSI project *will* become an ISO standard, which I consider
> inappropriate to suggest when it is not even yet at the stage of NWIP,
> and 

I think perhaps enthusiasm has been a motivating factor here, and only
an initial exposure to the standardisation process may lead to the
wrong assumptions. I note, however, this has been acknowledged to some 
extent and hopefully we can move on from this.

> (ii) suggest that it is the only long-term solution to clear needs
> when no needs analysis has been presented.

Out of interest, is there a document of the needs analysis for 639-3
readily available? Perhaps this would be a good starting point for
producing something of a similar/related nature. Google gave me nothing
for:
"needs analysis" ethnologue ISO
"needs analysis" ethnologue IETF

> - Is RFC 3066 (or a proposed successor) adequate to covern information
> organization for linguistic anthropology?

A slightly contrived example might be: I have a recorded collection of
speech of Middle Chulym, which for the sake of argument has a tag "myluhc".
On the one hand I wish to provide a description of it in English. On the
other hand, I'd like to provide a description in French, but perhaps
I'm not very good at French (true) and so I'll leave a placeholder that
might be treated as a comment. Now, I don't like providing XML-type
examples for human readability, but suppose for interchange I've created
some format that might contain a fragment:

<resource speech_lang_tag="myluhc">
	<store lang="en">
		<description xml:lang="en">Lots of words captured....</..>
	</..>
	<store lang="fr">
		<description xml:lang="en">Sorry, don't know French well</..>
	</..>
</..>

As I said, contrived, but perhaps this can help us propose an answer to
the question.

> - If not, *should* it be?

I hope the answer would come from above (by which I mean the example). 
Guided by the discussion below.

> To use an absurd illustration, we could extend RFC 3066 so that it was
> adequate for purposes of complete subject cataloguing in public
> libraries, encompassing all the categories of the Dewey Decimal System,
> but the possibility doesn't mean it's a good idea. We need to evaluate
> what are the kinds of properties needing to be attributed to information
> objects for what purposes, and what kinds of costs are involved
> *overall* to implementation.

If DDC / UDC can be related to 3066, I'm sure it would be useful to
understand the relationship. This doesn't mean either necessarily
supporting the other. Independence means they can both be used. UDC
should still exist, preferably as a standard so that there is control
over its evolution. A future 3066 may or may not find an importance for
it. The only question concerns the evaluation: 1 person, 1 group,
10 people, 1000 people? 1 country? How extensive should it be?

> To use a non-absurd analogy, we *could* encode every distinct attested
> Han ideograph in Unicode, but it has been decided that that level of
> granularity in this particular standard would not be in the best
> interest overall. It would be possible to create a coding for the
> distinct ideographs along with mapping to Unicode (and I believe such
> codings do exist), but nobody (that I know of) suggests that such a
> coding should supplant Unicode in any application contexts that
> currently use Unicode.

Some have seen the need to do the work that is not at the core of the
standard. If there are tags for 1000 languages, then we can have upto
1000 * 1000 names for the languages. Possibly more if there are variants
of a language name. There should be a system of these 1000 to which
people can add names, but making this the core of a standard would not
be wise. If for my purposes I present the name "Keith", but wherever
I use "Keith" I'm actually referring to "en" and use this code whenever
I exchange data, I should be ok.

> I certainly believe that there are users such as anthropological
> linguists that need to catalogue information at a fine level of
> granularity, even down to speaker x usage context; but even Linguasphere
> does not suggest that *that* level of granularity be coded. Clearly, it
> is a legitimate question to ask what level of granularity is practical
> and appropriate for coding.

Certainly. But certain combinations may be easier to pre-specify while
others require additional combinations. There's a perfectly good ISO
standard for dates/times. How to combine language and date is an
interesting question. We have "English, Middle (1100-1500)" with "enm"
already. Does this imply "en" is anything from 1500 onwards? Does en
include enm? For some uses, perhaps these codes are sufficient by themselves.
For others, perhaps not. 

> In general, I have no problem with coding at the kind of level of
> granularity used in Linguasphere, though I do see some potential
> practical pitfalls. One has to do with the enormity of the research
> required to support comprehensive classification at this level of
> granularity: with Linguasphere not having been widely available, it
> hasn't been extensively reviewed. The second has to do with establishing
> a basis for classification: it's difficult to decide when to code
> distinct *language* identities, but there are at least several guiding
> criteria; but for sub-language distinctions, there are potentially any
> number of distinctions that can be made, down to the level of idiolect
> (even utterance!), and it's not obvious how sub-language classifications
> should be defined. 

And, of course, we can't easily pigeon-hole a living language - we
collect "representative" samples and hope to get some sense of the
thing. By the time we measure it, it has moved. Something akin to
"quantum linguistics" if measuring it causes you to move it?

I believe the Linguasphere forum has been set up for precisely this
reason - because it's fluid, new discoveries or changes to extant
theories should be discussed and where possible improvements in
coverage can be made. New research in languages is bound to produce
shifts. 

> Even if we can consider such issues to have been adequately addressed,
> we still face the question of whether the comprehensive coding based on
> that classification meets a widely felt need and does so in a coherent
> way. I concede that anthropological linguists may be interested in
> classification at that level of granularity, but there are many other
> questions: 

Does UDC meet a widely felt need? I'm sure this is not an easy thing to
qualify. I think the following questions could be asked of any system
set up for any classificatory task.

> Does the one classification meet the needs of all anthropological
> linguists, or are different ones interested in different
> classifications?

Probably the latter, although this becomes in impediment to interoperability
unless there is some understanding of the combinations. If the former were
possible, better interoperability should ensue. At least we would hope that
to be the case.

> Does the need extend beyond this particular sector? If yes, does the
> same classification work for all sectors?

Probably, but at different levels for different requirements. One
classification, no matter how extensive, is never going to work everywhere.
(Private use extensions?)

> Do any of them need a comprehensive coding set at that level of
> granularity, covering tens of thousands of categories?

Again I would refer back to UDC.

Peter, on a lighter note, anybody noting that you work for Microsoft may
well ask the question about functionality creep in MS products - maxims
about 90% of users only using 10% of the functions. I couldn't possibly...

> If yes to all the above, what is the appropriate standards-setting body
> to establish a coding, and what are appropriate application contexts in
> which to reference that coding?

I don't think we can easily make a yes/no answer to the above questions,
there's too much ground in between. The appropriate standards body would
seem to be the one already looking after language codes. Defining
"appropriate" contexts may be tricky. Providing examples of uses should
be easier.

> In the latter regard, I am particularly uncomfortable with proposing
> such a large code set get used in RFC 3066 contexts as it's asking
> industry at large to adopt and support something that goes a level of
> magnitude beyond what the vast majority are presently interested or
> capable of dealing with -- you don't do it until there's adequate time
> for many stakeholders to evaluate the costs and benefits and decide
> whether or not to buy in.
> 

Do stakeholders have to buy in to other private use extensions? I'm not
sure I understand the intention of these in terms of implementation.

   For example: Users who wished to utilize SIL Ethonologue for
   identification might agree to exchange tags such as
   'az-Arab-x-AZE-derbend'. This example contains two extension subtags.
   The first is "AZE" and the second is "derbend".

> Certainly. But the user community for protocols like RFC 3066 represent
> a *wide* variety of sectors -- all, to be exact. Each sector's needs
> must be met, but they don't necessarily all have to be met using one
> protocol for everything that happens to look familiar.

> At one time, alpha-4 was
> suggested for ISO 639-3, but that was abandoned when it became clear
> that doing so would create obstacles to implementation with no
> significant benefit. 

What were the obstacles in this case, and how do they relate to the 
above private use extensions? 

> Certainly Linguasphere has possible benefits to
> offer from using alpha-4 rather than alpha-3 that ISO 639-3 would not,
> but those potential benefits still have to be shown to be of enough
> interest to overcome any obstacles to implementation.

Ok.

> > Indeed both the system and the identifier could be
> > noted in such a combination (e.g. sil-ttr).
> 
> I floated that idea a few years back, but there wasn't really interest
> in multiple systems, let alone the infrastructure needed to support it
> all.

Some ideas take time to germinate. We didn't have much "Semantic Web"
a few years back..... we have "xml:lang", this wouldn't seem to be
too distant.

> I consider this to be a *non* requirement for coding systems, and in
> fact an impossibility. (Ironically, it has been mainly David Dalby who
> has suggested to me that greater consideration should be given to
> mnemonicity in ISO 639-3 than I have though necessary.)

As I understand things, there have been prior discussions about how
certain codes MUST be mnemonic. I sit firmly on the *non* side, but would
then probably incur the wrath of those who want them. It's possibly
easier to "cater for" this particular discussion by having them that way
than to worry about it. Doesn't mean we *have* to like it.......