ID for language-invariant strings
Peter Constable
petercon at microsoft.com
Fri Mar 14 16:49:31 CET 2008
> From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> bounces at alvestrand.no] On Behalf Of Doug Ewell
> Sent: Thursday, March 13, 2008 11:16 PM
> To: ietf-languages at iana.org
> Subject: Re: ID for language-invariant strings
> ["zxx" is] a "less bad" fit than the other choices:
>
> zxx - content is not linguistic in nature
> und - content is in an undetermined language
> mis - content is in an otherwise uncoded language
> i-default - content is in a default, fallback language intelligible to
> anglophones
>
> I agree that inventing a new code element/subtag for this situation
> would be undesirable.
If it's less bad, I still think it kind of bad.
For instance, suppose I need to apply language tags to each of the data elements in the main ISO 639-3 code table. For data in columns like the 639-3 ID, clearly "zxx" applies: the alpha-3 identifiers have no linguistic content. But what about the reference names? "zxx" would be a decidedly bad choice for that column, IMO, since every single data element is definitely linguistic in nature.
I don't know why people are so adverse to new special-purpose code elements when there is a reasonable need. It's not like there are a lot of different special-case semantics that are needed in language-tagging application scenarios; I think the set is very small, perhaps even that this is the only important gap. I am *far* more concerned about overloading tags with distinct, orthogonal semantics for particular application scenarios ("und" means X in this application but Y in that application): *that* can lead to serious trouble.
As I think about this, I'm inclined to propose a new special-purpose ID "zrf" in ISO 639:
ID: zxn
Reference name: language-neutral content
Comment: This ID is provided primarily for application scenarios
in which a language identifier must be declared for
content that may be linguistic in nature but that is
used as a language-neutral identifier to reference or
index other information objects.
Uses of this code element do not make any declaration
regarding the actual language of a given data element
or of whether a given data element is, in fact,
linguistic in nature.
Note: for applications scenarios in which an identifier
string is unambiguously non-linguistic in nature, "zxx"
should be used rather than "zxn".
For example, in a database of coding elements for
cultural objects that includes for each such object a
code element such as an alpha-3 string (e.g., "abc")
and a reference name (e.g., "PIANO", "GUQIN"), the
language identifier applied to the code element
should be "zxx",but "zxn" may be applied to the
reference names.
Applications may also use "zxn" for content that is
Linguistic in nature but that is represented in a
Language-neutral form. For example, the concept 'ten'
Is linguistic in nature but can be expressed in the
Language-neutral form "10". Such use of "zxn" should
be considered only for application scenarios that
have a particular need; this usage is not recommended
in general. For instance, if a software application
needs to segment the strings in a document into items
that get passed to various language-specific processes
and it must apply a language identifier to language-
neutral content such as numbers represented as digits,
then "zxn" may be used within that application; but it
is not expected that content authors would apply "zxn"
to numbers in their documents in general.
Peter
More information about the Ietf-languages
mailing list