Question on ISO-639:1988

Thu Jun 3 21:06:35 CEST 2004

> From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> bounces at alvestrand.no] On Behalf Of Lee Gillam

Lee:

> it is fascinating to me that a paper
> can provoke such discussion

Actually, it was not the paper that evoked *my* reaction; rather, it has
been comments made on several occasions that (i) would seem to suggest
the BSI project *will* become an ISO standard, which I consider
inappropriate to suggest when it is not even yet at the stage of NWIP,
and (ii) suggest that it is the only long-term solution to clear needs
when no needs analysis has been presented.

> As a final comment here, ISO and BSI workflows are long, laborious,
and
> awkward to describe. Although work is ongoing in various quarters,
there
> is
> no guarantee that ANY of this work will be accepted yet by either
> organization...

Your comments are completely in concert with my concerns.

Up to now, I've limited my comments on this list to basically the above.
Having met and interacted with you on a couple of occasions, though, I
know you'd be interested in and up to working through some analytical
issues. So, I'll go deeper to raise some things for your consideration.
This will be a bit lengthy, and I don't necessarily expect a
point-by-point response.

> and whether
> it could be used to cover information organisation for linguistic
> anthropology, e.g. studies of speech.

There are two separate questions that need to be asked, not just one: 

- Is RFC 3066 (or a proposed successor) adequate to covern information
organization for linguistic anthropology?

- If not, *should* it be?

To use an absurd illustration, we could extend RFC 3066 so that it was
adequate for purposes of complete subject cataloguing in public
libraries, encompassing all the categories of the Dewey Decimal System,
but the possibility doesn't mean it's a good idea. We need to evaluate
what are the kinds of properties needing to be attributed to information
objects for what purposes, and what kinds of costs are involved
*overall* to implementation.

To use a non-absurd analogy, we *could* encode every distinct attested
Han ideograph in Unicode, but it has been decided that that level of
granularity in this particular standard would not be in the best
interest overall. It would be possible to create a coding for the
distinct ideographs along with mapping to Unicode (and I believe such
codings do exist), but nobody (that I know of) suggests that such a
coding should supplant Unicode in any application contexts that
currently use Unicode.

I certainly believe that there are users such as anthropological
linguists that need to catalogue information at a fine level of
granularity, even down to speaker x usage context; but even Linguasphere
does not suggest that *that* level of granularity be coded. Clearly, it
is a legitimate question to ask what level of granularity is practical
and appropriate for coding.

In general, I have no problem with coding at the kind of level of
granularity used in Linguasphere, though I do see some potential
practical pitfalls. One has to do with the enormity of the research
required to support comprehensive classification at this level of
granularity: with Linguasphere not having been widely available, it
hasn't been extensively reviewed. The second has to do with establishing
a basis for classification: it's difficult to decide when to code
distinct *language* identities, but there are at least several guiding
criteria; but for sub-language distinctions, there are potentially any
number of distinctions that can be made, down to the level of idiolect
(even utterance!), and it's not obvious how sub-language classifications
should be defined. 

Those two issues can intersect. So, for instance, when I look at the
classifications in the 1st edn related to Northern Thai, with which I
have a little familiarity, I find the classification to be not
particularly insightful: 10 classes at the lowest level: 

- "'archaic Buddhist literary' yuan" (Yuan script)
- "modern literary" lanna (Thai script)
- six varieties corresponding to six provinces, plus a separate
distinction for the Hot district of Chiang Mai province
- "lanna-frontier"

Listing six provinces really isn't insightful IMO as I suspect dialect
distinctions have more to do with geography, and the geography doesn't
coincide with the provincial boundaries. Nor is the treatment of written
forms all that insightful, for a couple of reasons: for N. Thai written
in Thai script, a significant orthographic issue has to do with tonal
representation and the significant tone-dialect distinction between
Chiang Rai / Lampang and Chiang Mai / Lamphun; and for Lanna script,
there's no reflection of the relationships between written N. Thai,
Khuen and Lue, not to mention how these might relate to Shan or Isaan.

I could raise other issues related to classification of Tai "Daic"
varieties, but this is enough to illustrate the point at hand: when I
evaluate that one portion of the classification, questions are raised in
my mind as to the basis for classification (what were the criteria for
distinguishing or relating different spoken or written varieties, and
were they appropriate criteria?) and the completeness of the research.

Even if we can consider such issues to have been adequately addressed,
we still face the question of whether the comprehensive coding based on
that classification meets a widely felt need and does so in a coherent
way. I concede that anthropological linguists may be interested in
classification at that level of granularity, but there are many other
questions: 

Does the one classification meet the needs of all anthropological
linguists, or are different ones interested in different
classifications?

Does the need extend beyond this particular sector? If yes, does the
same classification work for all sectors?

Do any of them need a comprehensive coding set at that level of
granularity, covering tens of thousands of categories?

If yes to all the above, what is the appropriate standards-setting body
to establish a coding, and what are appropriate application contexts in
which to reference that coding?

In the latter regard, I am particularly uncomfortable with proposing
such a large code set get used in RFC 3066 contexts as it's asking
industry at large to adopt and support something that goes a level of
magnitude beyond what the vast majority are presently interested or
capable of dealing with -- you don't do it until there's adequate time
for many stakeholders to evaluate the costs and benefits and decide
whether or not to buy in.

> While not trying to provide a history lesson,
> the origins of certain technologies can be contrasted with their
> business use.

Certainly. But the user community for protocols like RFC 3066 represent
a *wide* variety of sectors -- all, to be exact. Each sector's needs
must be met, but they don't necessarily all have to be met using one
protocol for everything that happens to look familiar.

> The "Languages of London"
> (http://www.linguasphere.org/m
> ulti_cap.html) provides a specific
> example of a (non-IT) use-case

Since that web site does not, that I have detected, make use of
highly-granular classifications, I don't see how that statement can be
evaluated. Even if the LofL site provided detailed information on
various linguistic varieties are spoken by what populations in what
geographic distributions in the environs of London, there's no use for a
detailed coding system unless it's used to reference individual
resources. So, suppose you start to reference web pages about
businesses, community groups, events, etc. within the environs of London
wrt language varieties, how many people are going to want to classify at
that level of granularity, how many are going to want to make use of
that level of granularity, and will the axes of distinction they're
interested in all coincide?

It's a lot of machinery, and all I've heard so far is a handful of
people involved in building the machinery saying it's a good thing; I
haven't yet heard anybody else wanting to implement or use it saying,
Wow, that meets a big need we've been concerned about. If there was such
a response from the LREC meeting in Lisbon, then please do provide
details.

> If, however, systems can cope with use of e.g. "art-lojban", alpha 4
is,
> surely, trivial?

In one sense, it would be trivial to replace xml:lang with some
completely new attribute, xml:ling (say) using some different coding
scheme than RFC 3066. In the same sense, it would be trivial to replace
XML with another spec YML that differed (say) only in using "[]" in
place of "<>". (I'm sure you could produce a complete set of first draft
docs in a matter of hours.) But obviously, there is another sense it
which these would be anything but trivial. At one time, alpha-4 was
suggested for ISO 639-3, but that was abandoned when it became clear
that doing so would create obstacles to implementation with no
significant benefit. Certainly Linguasphere has possible benefits to
offer from using alpha-4 rather than alpha-3 that ISO 639-3 would not,
but those potential benefits still have to be shown to be of enough
interest to overcome any obstacles to implementation.

> Indeed both the system and the identifier could be
> noted in such a combination (e.g. sil-ttr).

I floated that idea a few years back, but there wasn't really interest
in multiple systems, let alone the infrastructure needed to support it
all.

> I admit to finding the
> requirement for mnemonic labels slightly odd

I consider this to be a *non* requirement for coding systems, and in
fact an impossibility. (Ironically, it has been mainly David Dalby who
has suggested to me that greater consideration should be given to
mnemonicity in ISO 639-3 than I have though necessary.)

Peter Constable