ID for language-invariant strings

Fri Mar 14 16:49:31 CET 2008

> From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> bounces at alvestrand.no] On Behalf Of Doug Ewell
> Sent: Thursday, March 13, 2008 11:16 PM
> To: ietf-languages at iana.org
> Subject: Re: ID for language-invariant strings

> ["zxx" is] a "less bad" fit than the other choices:
>
> zxx - content is not linguistic in nature
> und - content is in an undetermined language
> mis - content is in an otherwise uncoded language
> i-default - content is in a default, fallback language intelligible to
> anglophones
>
> I agree that inventing a new code element/subtag for this situation
> would be undesirable.

If it's less bad, I still think it kind of bad.

For instance, suppose I need to apply language tags to each of the data elements in the main ISO 639-3 code table. For data in columns like the 639-3 ID, clearly "zxx" applies: the alpha-3 identifiers have no linguistic content. But what about the reference names? "zxx" would be a decidedly bad choice for that column, IMO, since every single data element is definitely linguistic in nature.

I don't know why people are so adverse to new special-purpose code elements when there is a reasonable need. It's not like there are a lot of different special-case semantics that are needed in language-tagging application scenarios; I think the set is very small, perhaps even that this is the only important gap. I am *far* more concerned about overloading tags with distinct, orthogonal semantics for particular application scenarios ("und" means X in this application but Y in that application): *that* can lead to serious trouble.

As I think about this, I'm inclined to propose a new special-purpose ID "zrf" in ISO 639:

ID: zxn
Reference name: language-neutral content
Comment: This ID is provided primarily for application scenarios
         in which a language identifier must be declared for
         content that may be linguistic in nature but that is
         used as a language-neutral identifier to reference or
         index other information objects.

         Uses of this code element do not make any declaration
         regarding the actual language of a given data element
         or of whether a given data element is, in fact,
         linguistic in nature.

         Note: for applications scenarios in which an identifier
         string is unambiguously non-linguistic in nature, "zxx"
         should be used rather than "zxn".

         For example, in a database of coding elements for
         cultural objects that includes for each such object a
         code element such as an alpha-3 string (e.g., "abc")
         and a reference name (e.g., "PIANO", "GUQIN"), the
         language identifier applied to the code element
         should be "zxx",but "zxn" may be applied to the
         reference names.

         Applications may also use "zxn" for content that is
         Linguistic in nature but that is represented in a
         Language-neutral form. For example, the concept 'ten'
         Is linguistic in nature but can be expressed in the
         Language-neutral form "10". Such use of "zxn" should
         be considered only for application scenarios that
         have a particular need; this usage is not recommended
         in general. For instance, if a software application
         needs to segment the strings in a document into items
         that get passed to various language-specific processes
         and it must apply a language identifier to language-
         neutral content such as numbers represented as digits,
         then "zxn" may be used within that application; but it
         is not expected that content authors would apply "zxn"
         to numbers in their documents in general.

Peter