What's the plan for ISO 639-3 and RFC 3066 ter?

Tue Aug 17 14:14:04 CEST 2004

Addison Phillips [wM] scripsit:

> It seems odd to have a bunch of language collection codes that don't
> bother to collect up the languages... 

I'd say "Of course" if we hadn't had just that in 639-2 for years.

> No, lmn is a synonym for oc-lmn and since, according to matching from
> 3066 onwards, you cannot trim parts of the requested range, (range)
> 'lmn' doesn't match (tag) 'oc'. The thing that does the falling back is
> the tag on the content being selected. 

Sorry, yes, of course: brain fart on my part.  Don't ask for lmn unless
you mean it, for the same reason that you don't ask for zh-yue unless
you mean it.  Of course, there is nothing to prevent the requester from
falling back externally to the standard algorithm:  I can ask for en-us
(so as not to have to see those pesky British spellings) and then fall
back to en myself when I get, er, nada.

> Martin Dürst has pointed out that this is the reverse of locale
> fallback systems in that with language tags you want to specify the
> *least* specific tag you'll accept, generally returning more content
> the less specific you are. 

Very true.  Perhaps this paragraph belongs in the draft as informative
text.

> Right. So the problem is to engineer the most reasonable mechanism
> for getting to the right results, regardless of what tomfoolery ISO
> 639-x gets up to. Ideally we pick a mechanism that is most likely to
> work with the various ISO 639 parts, hoping against hope not to have
> to change it later.

Indeed.  After all, though, the Good Guys have pretty much gotten control
of the WG at this point; I'm not worried that ISO will conspicuously do
the wrong thing.  I'm more worried about timing.

When 639-3 goes through, we can reasonably lay down the law in 3066ter
about all the thousands of individual language codes that will become
available, more or less as follows:  if it's part of a macro-language,
it's an alias; if not, it isn't.  So iii (Sichuan Yi) becomes a valid
language tag, and so does yue (Cantonese), but the latter is an alias
for zh-yue (which it just so happens is also a grandfathered form
that will now become productive, so we can have (zh-)yue-Latn and
(zh-)yue-taishanese and what not), and so does ccx (Northern Zhuang),
which will be an alias for za-ccx, not grandfathered.

But then comes 639-5, and we have to decide whether to take the same
approach to collective codes.  Now a collective code is more or less
a fallback position.  Knowing that something is written in nai (a North
American Indian language) is tolerable for cataloguers, but it does little
or nothing for people who want to actually use the resources, because you
can't predict whether it'll be useful for you or not -- nobody can read
and understand all the nai languages.  (Remember that the distinction
between a macro-language and a collective is that a macro-language
is seen as a single entity sometimes and as a group of fairly closely
related entities at other times, whereas a collective is purely a group.)

The editor of 639-5 has a serious decision to make: to create a minimalist
list of collectives, more or less just the ones grandfathered in 639-2;
or to actually try to make something more or less coherent and useful by
itself.  And then he or she will run smack into the problem that language
collections are highly theory-laded entities, and they change quite
frequently with extensions in knowledge or even just scholarly fashion.
Battles royal rage over whether Nostratic means anything, and if so,
exactly what.  Ethnologue's classification is a fairly conservative one,
but there will be real trouble if the coding world is asked to just
swallow it whole (whereas swallowing the language list whole isn't that
big a deal, considering that the main issues are probably of the form
"Language or dialect?" which are known to be not fully soluble anyhow).

Now let's take a fairly non-controversial potential collective like the
Germanic languages.  Anything more than a minimalist 639-5 is probably
going to encode this.  We certainly do not want to say that en is now
going to be an alias for *gmc-eng, where *gmc is a hypothetical code for
"Germanic languages".  And yet we have, or may have, legacy data encoded
faute de mieux using a collective from 693-2, so we can't just banish all
thought of collective codes.  I suspect the best we can do is to allow,
while discouraging, the use of collective codes in 3066ter (or 3066quater,
depending on timing), and to disallow the use of collective-individual
code pairs like *gmc-eng altogether.

Sorry to go on at such length, but I think the issues are going to have
to be fully aired, if not fully decided, so we can make sure that 3066bis
is preadapted to what's coming.

> I suspect that 'range as alias' is the most likely
> mechanism. It is certainly more flexible (a superset) of the other
> solutions that leap to mind. But it also doesn't solve all problems
> (witness the matching example).

Nothing does.

-- 
Mark Twain on Cecil Rhodes:                     John Cowan
"I admire him, I freely admit it,               http://www.ccil.org/~cowan
 and when his time comes I shall                http://www.reutershealth.com
 buy a piece of the rope for a keepsake."       jcowan at reutershealth.com