ID for language-invariant strings

Fri Mar 14 19:16:08 CET 2008

> From: John Cowan [mailto:cowan at ccil.org]

> > For instance, suppose I need to apply language tags to each of the
> data
> > elements in the main ISO 639-3 code table. For data in columns like
> > the 639-3 ID, clearly "zxx" applies: the alpha-3 identifiers have no
> > linguistic content.
>
> Except when they do:  the tag "yue" is simply the Mandarin name for
> Cantonese.

No, I very much disagree: language identifiers/tags are non-linguistic strings. In the particular case of "yue", it *incidentally* happens to take the same form as one of the linguistic names for the concept being referenced; but across the board the alpha-3 IDs in ISO 639 are non-linguistic, symbolic IDs.

> > But what about the reference names? "zxx" would be a decidedly bad
> > choice for that column, IMO, since every single data element is
> > definitely linguistic in nature.
>
> Linguistic in origin, but not in purpose: the names mostly look like
> English, and many of them are in fact English in origin; but they are
> not there because they are English, but (once again) by fiat of the RA.
> So these names too are, according to my argument, non-linguistic: they
> are
> in essence arbitrary tokens that happen to be mnemonic for anglophones.

No, they are *not* completely arbitrary tokens. We would never, for instance, assign a reference name of "LANG QX13PB6". We want them to be linguistic in nature because we want humans referring to the code table to be able to cross-reference them to other documentation indicating the intended semantic if not be able to recognize directly the intended semantic.

> To re-use an example I have given elsewhere:  "if time-of-day is equal
> to
> 1200, then move money to account" is English, if somewhat stilted
> English.
> In its context of use, though, it is Cobol, and should be tagged "zxx",
> not "en".

Yes, I see the point you're making. And the Cobol example has a lot of similarity to the application scenarios I've described. But I'm not fully convinced. In particular, programming languages are definitionally out of scope for any ISO 639 ID other than "zxx", and Cobol is pretty exceptional in its similarity to a human language.

Peter

>
> > I don't know why people are so adverse to new special-purpose code
> > elements when there is a reasonable need. It's not like there are a
> lot
> > of different special-case semantics that are needed in language-
> tagging
> > application scenarios; I think the set is very small, perhaps even
> > that this is the only important gap. I am *far* more concerned about
> > overloading tags with distinct, orthogonal semantics for particular
> > application scenarios ("und" means X in this application but Y in
> that
> > application): *that* can lead to serious trouble.
>
> The answer, in a word, is creeping featurism.  Saying "the set is very
> small" is sheer unfounded speculation: every time someone tries to do
> something new, another wannabe tag appears.
>
> >          Note: for applications scenarios in which an identifier
> >          string is unambiguously non-linguistic in nature, "zxx"
> >          should be used rather than "zxn".
>
> I think the examples above would tend to shake the notion that
> "unambiguously non-linguistic" is a meaningful expression.
>
> --
> That you can cover for the plentiful            John Cowan
> and often gaping errors, misconstruals,
> http://www.ccil.org/~cowan
> and disinformation in your posts                cowan at ccil.org
> through sheer volume -- that is another
> misconception.  --Mike to Peter