Question on ISO-639:1988

jcowan at reutershealth.com jcowan at reutershealth.com
Thu Jun 3 21:24:21 CEST 2004


Lee Gillam scripsit:

> > I once (with shame I confess it) tagged about 30,000 Japanese resources
> > as "jp" rather than the correct "ja" ("jp" of course being the ISO 3166-1
> > code for Japan), but at least "jp" doesn't mean, say, "Buginese".
> 
> But I guess in neither case could a system retrieve it as it would be
> expected to?

There's a great difference between false negatives and false positives.  If
a general enquirer had asked my system "Got any Japanese?" it would have
answered "No".  But at least if asked "Got any Buginese?" it would not have
answered "Yes" and then sent irrelevant and unreadable Japanese!

In fact, there were no general enquirers, because those who retrieved the
resources did so under contract, and I simply explained to them the
encoding error.

> > (Who knows how often our luggage destined for Oakland, California (OAK)
> > is sent to Oamaru, N.Z. (OAM) instead, to be retrieved only after
> > ruinous delays?)
> 
> I would hope it is much less than 50%. But, are the identifiers at fault,
> or is it the application of an incorrect assumption that might lead to such
> a result. 

Well, I chose two rather mnemonic IATA airport codes, thus partially undermining
my own point.  There is a school of thought that says "*Never* choose codes
that are mnemonic in any way, because whatever Real World name they are attached
to will inevitably change out from under you; it's better to have arbitrary
tags that must be mapped."  To these folks I like to point out that the city
name "Roma" has been stable for the last 2756 years, rather more than anyone
has a right to expect of any nomenclature whatsoever.

> Perhaps classifying language using language is in itself a flawed endeavour?

Flawed but not useless.  We cannot in the end trap meanings perfectly in the
net of words, but we needs must try.

> a further paper: http://www.linguasphere.com/doc/lrec_workshop2004.pdf

I read this paper already, thanks.

> An 8 layer hierarchy with 26 possible elements per layer can be used to
> produce a very large system (2 followed by 11 zeros approximately).

But only if everything falls just right, and it does not.  The IP addresses
used on the Internet have a maximum extent of 2^32 = about 4 billion codes,
but we are running out of them even though we do not have 4 billion computers
in the world, because they are assigned in a hierarchical manner and some
parts of the system are much sparser than others.  The successor protocol
allows 2^128 codes precisely so that it can be extremely sparse throughout
in hopes that it will not run over.

> If upto 26 can be considered to form a "section", "group", "cluster", 
> "family" or whatever it may be referred to as, at minimum there is 
> a systematic approach to this. Also, perhaps (although not yet encountered),
> an "empty" grouping can be used to subsume upto 676? Perhaps a need
> here to reserve a given letter?

Possibly.  But what really shoots down these hierarchical keys is that they
are unstable, and intentionally so -- they are meant to change as the state of
knowledge about the (chosen) hierarchy changes.  Yet without them, the
rich 4-letter identifier codes are almost uninterpretable.

> I also agree that it would be good to support, at least, questions of both
> geographical (if there is a north side of a South Circular Road in Dublin)

The North and South Circular Roads considered jointly do form a circle more or
less, though each of them is only semi-circular.  So they do have north and
south sides.

> And indeed the road ahead is not an easy one.

Amen.  Though hopefully it is not circular.

-- 
"Well, I'm back."  --Sam        John Cowan <jcowan at reutershealth.com>


More information about the Ietf-languages mailing list