ID for language-invariant strings

Mon Mar 17 23:26:28 CET 2008

Karen: I suggested “no linguistic content” on the understanding that the audio and subtitle streams were all tagged separately, and that it would be an audio stream about which was declared “no linguistic content”, not the film as a whole.

Peter

From: Karen_Broome at spe.sony.com [mailto:Karen_Broome at spe.sony.com]
Sent: Monday, March 17, 2008 2:25 PM
To: Peter Constable
Cc: ietf-languages at iana.org
Subject: RE: ID for language-invariant strings

The "zxx" tag started with my query into how I should classify the "audio content" of a silent film in a system designed to serve non-silent films where a language code is required. Peter suggested "zxx = no linguistic content" and registered it.

I felt that it might be better to use the industry terminology "silent" and employ a free tag in the "Q" space of ISO 639-2. While there was "no linguistic content" on that audio channel, there was certainly a plot that could be determined from watching the film even if the title cards were removed (a "title card" is an interstitial used to display the text in a silent film). To describe our wonderful heritage of silent films as having no linguistic content just seemed a bit cruel. I was willing to go with "not applicable" but could not recommend the use of "zxx = no linguistic content" for this purpose.

When it was later suggested that "zxx" should be used to mark up code fragments appearing in a tutorial written in English, I was even more opposed to the "non-linguistic" semantic. I wasn't the only one who complained that code -- especially in the context of a technical tutorial -- is primarily meant to be read by humans, not machines. An assistive device such as a Braille screenreader would  want to represent that text as language, not skip over it because it's non-linguistic in nature. Binary junk data is the only thing I can think of that is truly non-linguistic.

Any chance we could broaden the semantic of the "zxx" tag? I still think we did the wrong thing here and the "non-applicable" tag is more appropriate for all the use cases mentioned.

        http://lists.w3.org/Archives/Public/www-international/2007AprJun/0187.html -- one previous post on the topic

Side note: I find the IETF archives very hard to search or I could have produced a better example. Am I missing a search interface somewhere? (Reply offlist.)

Regards,

Karen Broome

Peter Constable <petercon at microsoft.com> wrote on 03/14/2008 01:37:30 PM:

> If “zxx” were “not applicable”, I would not have any reservation
> about semantic overloading for the application scenarios I have in
> mind now. Funny, I really have no recollection of you suggesting
> that at that time. (Sorry.)
>
>
> Peter
>
> From: Karen_Broome at spe.sony.com [mailto:Karen_Broome at spe.sony.com]
> Sent: Friday, March 14, 2008 12:51 PM
> To: Peter Constable
> Cc: ietf-languages at iana.org
> Subject: RE: ID for language-invariant strings
>
>
> I can keep restating the point I've made from the beginning. The
> semantic for "zxx" should have been defined as "not applicable"
> which was the use case presented at the time it was created. Since
> it was not expressed in this way, now we need another tag, I think.
>
> Regards,
>
> Karen Broome
> Metadata Systems Designer
> Sony Pictures Entertainment
> 310.244.4384
>
> ietf-languages-bounces at alvestrand.no wrote on 03/14/2008 08:49:31 AM:
>
> > > From: ietf-languages-bounces at alvestrand.no [mailto:ietf-languages-
> > > bounces at alvestrand.no] On Behalf Of Doug Ewell
> > > Sent: Thursday, March 13, 2008 11:16 PM
> > > To: ietf-languages at iana.org
> > > Subject: Re: ID for language-invariant strings
> >
> > > ["zxx" is] a "less bad" fit than the other choices:
> > >
> > > zxx - content is not linguistic in nature
> > > und - content is in an undetermined language
> > > mis - content is in an otherwise uncoded language
> > > i-default - content is in a default, fallback language intelligible to
> > > anglophones
> > >
> > > I agree that inventing a new code element/subtag for this situation
> > > would be undesirable.
> >
> > If it's less bad, I still think it kind of bad.
> >
> > For instance, suppose I need to apply language tags to each of the
> > data elements in the main ISO 639-3 code table. For data in columns
> > like the 639-3 ID, clearly "zxx" applies: the alpha-3 identifiers
> > have no linguistic content. But what about the reference names?
> > "zxx" would be a decidedly bad choice for that column, IMO, since
> > every single data element is definitely linguistic in nature.
> >
> > I don't know why people are so adverse to new special-purpose code
> > elements when there is a reasonable need. It's not like there are a
> > lot of different special-case semantics that are needed in language-
> > tagging application scenarios; I think the set is very small,
> > perhaps even that this is the only important gap. I am *far* more
> > concerned about overloading tags with distinct, orthogonal semantics
> > for particular application scenarios ("und" means X in this
> > application but Y in that application): *that* can lead to serious trouble.
> >
> > As I think about this, I'm inclined to propose a new special-purpose
> > ID "zrf" in ISO 639:
> >
> > ID: zxn
> > Reference name: language-neutral content
> > Comment: This ID is provided primarily for application scenarios
> >          in which a language identifier must be declared for
> >          content that may be linguistic in nature but that is
> >          used as a language-neutral identifier to reference or
> >          index other information objects.
> >
> >          Uses of this code element do not make any declaration
> >          regarding the actual language of a given data element
> >          or of whether a given data element is, in fact,
> >          linguistic in nature.
> >
> >          Note: for applications scenarios in which an identifier
> >          string is unambiguously non-linguistic in nature, "zxx"
> >          should be used rather than "zxn".
> >
> >          For example, in a database of coding elements for
> >          cultural objects that includes for each such object a
> >          code element such as an alpha-3 string (e.g., "abc")
> >          and a reference name (e.g., "PIANO", "GUQIN"), the
> >          language identifier applied to the code element
> >          should be "zxx",but "zxn" may be applied to the
> >          reference names.
> >
> >          Applications may also use "zxn" for content that is
> >          Linguistic in nature but that is represented in a
> >          Language-neutral form. For example, the concept 'ten'
> >          Is linguistic in nature but can be expressed in the
> >          Language-neutral form "10". Such use of "zxn" should
> >          be considered only for application scenarios that
> >          have a particular need; this usage is not recommended
> >          in general. For instance, if a software application
> >          needs to segment the strings in a document into items
> >          that get passed to various language-specific processes
> >          and it must apply a language identifier to language-
> >          neutral content such as numbers represented as digits,
> >          then "zxn" may be used within that application; but it
> >          is not expected that content authors would apply "zxn"
> >          to numbers in their documents in general.
> >
> >
> >
> > Peter
> > _______________________________________________
> > Ietf-languages mailing list
> > Ietf-languages at alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/ietf-languages
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.alvestrand.no/pipermail/ietf-languages/attachments/20080317/c35919f6/attachment-0001.html