Question on ISO-639:1988

Thu Jun 3 15:46:39 CEST 2004

Dear list members,

As a newcomer to this list (so only considering mails from this month), 
I'm finding some of the comments quite interesting. 

As a co-author to the paper being discussed, which I believe to be the
4-pages from the main scientific track at the recent Language Resources
and Evaluation Conference (LREC), held in Lisbon at the end of May, not
the 8-pages presented at a workshop, it is fascinating to me that a paper 
can provoke such discussion. Some academic papers are read by their
authors and very few others (if that!). Such discussion is excellent and 
further debate can only help to shape, structure and assist in the
required efforts to promote understanding of ALL these initiatives.

I would pose one caveat however - it's difficult to deal with all the
issues being raised in the scope of a 4-page, or even 8-page paper.
Some statements have to remain so generic as to be almost vacuous.
This is unfortunate, but unavoidable.

I don't know whether I've seen the (proposed? modified? adopted?)
latest RFC 3066 (2001 seems to be the common one I can find), and whether
it could be used to cover information organisation for linguistic
anthropology, e.g. studies of speech. Jeremy Carroll's identification 
of the need for use-cases, and how they would assist specific communities
of users is one that should be taken up - perhaps by all parties involved in
the definition of 639? Towards some kind of roadmap for adoption
or the like?

To take a slight aside, the 10 main classes, 100 divisions and 1000 
sections of the Dewey Decimal System do not specifically hint at the 
complexity of the Universal Decimal Classification, in which Bibilical 
Hebrew is listed with code 811.411.16'02, and Biblical Aramaic is listed 
with code 811.411.171.1'02. Although a specific researcher is more likely 
to only be interested in certain locations of works classified by this 
system, and is almost certainly not going to organise his bookcase
according to it, the classification of all works within a library remains 
a useful component in its storage. This does not prevent those with other
requirements from using different classification systems based on other
principles (with other presumptive "Universal" titles), e.g. the Lenoch 
Universal Classification, which may be better understood for that purpose. 
UDC and LUC are both broad and highly granular, although UDC probably
wins on number of classifiers.

The difference between an occasional need for additional granularity, 
versus any requirement for a system that can be used for all works, is
clear. Works that can be organised (shall we say classified) within the
context of some system can be related to other elements within that
system, whether or not that was the intention. For purposes such as
resource (or what some decide to call "knowledge") discovery, this can be
useful. The more axes along which information can be classified, even
using orthogonal systems of classifiers, the more it can be related to
other elements across systems and potentially support research questions
which have yet to be formulated. Obviously, the principle of least work
applies, so for example I would not use most of 639-1 for small ad hoc
resources. However if in doing so the mappings to other systems of
classifiers is available, others who did not originally use -1 may be
able to make use of it also, and I of their data, for mutual benefit.

On a slightly different subject, the Internet began militarily (ARPA),
and the Web did not start because business needed it but because a few 
physics researchers wanted to share information. HTML was produced as a 
means to do this, but others prefer PostScript and PDF to provide
consistency in presentation. While not trying to provide a history lesson,
the origins of certain technologies can be contrasted with their
business use. Grid computing does not yet appear to have moved from
academia to industry in any serious way because the business needs have
not been established, although a landmark publication comprising a review
of various technologies appeared in 1999 based on work of that decade - 
and who would have thought that ring-tones for telephones could ever 
provide be used as a means to making profits? There are a lot of
institutes collecting speech resources, and should a few decide to
collect some for, say, Middle Chulym (http://www.ironboundfilms.com/timesoflond
on.html), they may wish to
have a specific identifer, or perhaps more than one, that can help others
use this also. If ISO identifiers can support such activities, perhaps
eventually business may make use of them (after all our computers
become Grid-desktops). The "Languages of London" (http://www.linguasphere.org/m
ulti_cap.html) provides a specific
example of a (non-IT) use-case, although it may have implications for IT
provision of specific types of information to these communities. 

A point for IT systems is that they can support whatever their implementers
want them to. Ideally, an IT system should not have to worry about
issues of classification and should be able to query a system of 
classification to discover (infer) relations to other resources classified
within the same, or perhaps a similar, system. Any individual distinction may 
or
may not be possible, depending on the construction of the system - if we
only include colours and sizes of shirts, we cannot classify by sleeve
length but through various inferences, we may be able to bring such
classifiers together than indirectly enable this. Whether the system can 
be modified to encompass this is an issue of whether it can be supported,
financially or otherwise. The promise of intelligent systems is that the
classification can be automated. Some aspects of automatic classification 
are already possible, and increased automation and personalization of
service provision will doubtless increase the number of classifers used
for information of all types. Complexity is an issue for human use, better
technologies may reduce the perceived complexity to provide for management
of 25000, 70000, 450000 or even more identifiers.

Of course, for legacy systems, if ISO 639-1 were to be frozen in its
current state, this would leave a number of unused alpha2s that could 
be adopted by individuals or organisations to refer to identifiers elsewhere 
- e.g. reference to alpha3s. Interoperability then requires some additional,
but not impossible, translation from system to, say, interchange format. 
If, however, systems can cope with use of e.g. "art-lojban", alpha 4 is, 
surely, trivial? Indeed both the system and the identifier could be
noted in such a combination (e.g. sil-ttr). I admit to finding the
requirement for mnemonic labels slightly odd - international transport
functions well with Toronto airport having a YYZ tag. This difference
between having an identifier and having a specifically constructed
identifier shows when one considers that the 450 or so alpha3s could be
catered for as alpha2s if this requirement were removed. Of course, this
would remove the "big language-small language" distinction, but is it
a crucial distinction?

As a final comment here, ISO and BSI workflows are long, laborious, and 
awkward to describe. Although work is ongoing in various quarters, there is 
no guarantee that ANY of this work will be accepted yet by either 
organisation, by which I mean through national member body ballots.
Proposals for new standards are not necessarily "new work item proposals" 
under the ISO process. There are several necessary iterations for the
(ISO) proposals of -3, -4 and -5, and should the (ISO) proposal for -6 be
accepted this will also have to undergo such a process. The implication
of course is that not of these items should be spoken of as ISO standards
until they are published. Use of e.g. ISO/NP, ISO/WD, ISO/CD and so on
should prevent further misunderstanding so long as somebody explains the
workflow. Eventually, those newer to ISO will also "speak" 
ISO. After 4 years of involvement, I'm still discovering how.

And as a slight aside, I suspect France or Germany will win Euro 2004,
with Italy and Portugal in the semi finals also. It would be nice if 
England won, but current form suggests they'll be home before the postcards.

Best Regards, and apologies for the length and number of issues I've attempted
to discuss in this posting.