Question on ISO-639:1988

Thu Jun 3 20:30:15 CEST 2004

> RFC numbers, unlike ISO ones, are replaced absolutely when new versions are
> issued.  The precedessor of RFC 3066 was RFC 1766, and its successor will
> get a number in the 3300s or higher, depending on when it is actually issued.

Thanks. Another example if one were needed of different systems of 
identifiers and how they are used in different (standards) processes. 

> > this would leave a number of unused alpha2s that could be adopted by
> > individuals or organisations to refer to identifiers elsewhere
> 
> A very dangerous practice.  Few modern databases are so restrictive that
> they can use only 639-1 for technical reasons.  The community has
> painfully learned that unassigned codes of any sort (outside dedicated
> private-use areas) should be left severely alone.

Wholehearted agreement here. The point was that certain types of (extremely,
which probably means more than 5-10 years old) legacy systems could be
catered for by "devious" programmers, provided the correct alarm bells
were also wired in.

> It does provide a certain degree of robustness against coding errors.
> I once (with shame I confess it) tagged about 30,000 Japanese resources
> as "jp" rather than the correct "ja" ("jp" of course being the ISO 3166-1
> code for Japan), but at least "jp" doesn't mean, say, "Buginese".

But I guess in neither case could a system retrieve it as it would be
expected to?

> (Who knows how often our luggage destined for Oakland, California (OAK)
> is sent to Oamaru, N.Z. (OAM) instead, to be retrieved only after
> ruinous delays?)

I would hope it is much less than 50%. But, are the identifiers at fault,
or is it the application of an incorrect assumption that might lead to such
a result. If these tags were, instead, a 10-digit number, completely
unrelated to telephone numbers or some such, would it prevent such 
possible occurrences? I am thinking, of course, of a longer identifier
than this - a URI, perhaps to an element in an "ontology", that describes
the tag and its various familial associations. Perhaps supporting other
types of relationships also.

Perhaps classifying language using language is in itself a flawed endeavour?
A dictionary is only useful if you can interpret its contents.

> > [snip] Euro 2004 [snip]
> Until today, I (ignorant Yank) didn't even know what it was.

The simple explanation is, it's a football (soccer) event that occurs
every 4 years and which despite only 16 teams participating, England
have never won. You're probably better off not knowing.

In relation to the discussion below, although I cannot (yet) provide
a convincing argument in direct response, I would like to mention
a further paper: http://www.linguasphere.com/doc/lrec_workshop2004.pdf
This discusses the composition of the linguasphere scale (bottom p4,
top p5) and an example of use (p6). If 4 pages generates such discussion,
I look forward to what comes out of 8!

An 8 layer hierarchy with 26 possible elements per layer can be used to
produce a very large system (2 followed by 11 zeros approximately).
26 has only been chosen because of the alphabet - if China had defined
such standards to begin with, how many possible elements might there be?
If upto 26 can be considered to form a "section", "group", "cluster", 
"family" or whatever it may be referred to as, at minimum there is 
a systematic approach to this. Also, perhaps (although not yet encountered),
an "empty" grouping can be used to subsume upto 676? Perhaps a need
here to reserve a given letter?

My note about asking the system to do the work would, ideally, allow
for these types of inferences (across to siblings, up to parents,
grandparents, to cousins, and so forth). I would make reference again
to the oft-used (whether correctly or not) word "ontology", which might 
support such inference. Through various programmatic means of course.

I also agree that it would be good to support, at least, questions of both
geographical (if there is a north side of a South Circular Road in Dublin) and
linguistic types. Perhaps this, and other types of questions may be 
supportable
by other, additional and complementary approaches. The more there is to 
be supported, the more identifiers are going to be needed.

The point, for me at least, is that if you wish to, you can elect to
use or not use specific systems of identifiers for whatever reason.
The same as you can choose a certain mobile telephone, a certain email
package and so forth. Luckily, thanks to certain protocols such items
can communicate in certain ways with each other but that does not prevent
the additional functions from being unusable by certain people - e.g.
not being able to receive photos on all phones. If there are protocols
for being able to *roam* across systems of identifiers, if it is
necessary for some group, the question to answer is: can such a
need be supported internationally by bootstrapping an existing highly
specified resource? The follow-up question is: can we support 
international research efforts, where requirements are perhaps less
well specified, by attempting this also?

What should be returned in response to an empty query is for system
designers to decide - I would not personally make a presumption of
what was closest, but offer the ability to generalize and/or traverse
in various directions, or to provide other short-cuts to whatever
type of relationship a certain system may require.

And indeed the road ahead is not an easy one.

> The worst problem I see with the Linguasphere identifiers is the extreme
> difficulty of relating the more general to the less general, as must be
> done if requests are to be appropriately satisfied.  It may make sense
> to assign distinct 4-letter codes to such linguistic entities as:
> 
> 	English
> 	Hiberno-English
> 	Hiberno-English, spoken
> 	Hiberno-English, spoken in Dublin
> 	Hiberno-English, spoken in Dublin on the North Circular Road
> 	Hiberno-English, spoken in Dublin on the North Circular Road (south side)
> 
> but a supplier of information that has content tagged with the last
> code will not be able to reply to a request for simply "English" unless
> it grasps this particular branch of the entire system (which leads up
> to "Germanic" and "Indo-European" at higher levels, if I understand
> correctly).
> 
> In order to do this, it must have the Linguasphere key (hierarchical
> identifier) corresponding to the 4-letter code, but this is (a) unstable
> and (b) brittle, with its fixed maximum hierarchical depth of 8 and its
> limited fanout of 10 to 26 siblings at each level.
> 
> In addition, any such hierarchical system that implements only one
> hierarchy (a mixture of geographical and phylogenetic information,
> and as far as the 2-digit value that forms the first two tree levels,
> very ingeniously designed) will often produce the wrong answer.  Thus,
> if Irish information is requested and none is forthcoming, it is almost
> certainly going to be better to return English (agreeing in the first
> digit of the hierarchical code only) than Welsh (agreeing in the first
> two digits).
> 
> In short (and while I am not judging the system in full, not having seen
> it in full), I very much suspect that for IT purposes the game will not
> be worth the candle.  I wish it were.
> 
> -- 
> "Take two turkeys, one goose, four              John Cowan
> cabbages, but no duck, and mix them             http://www.ccil.org/~cowan
> together. After one taste, you'll duck          jcowan at reutershealth.com
> soup the rest of your life."                    http://www.reutershealth.com
>         --Groucho