[Ltru] Re: "mis" update review request

John Cowan cowan at ccil.org
Sat Apr 21 06:45:32 CEST 2007


Mark Davis scripsit:

> I don't think the programming language fragment is really a boundary
> condition. Most code source nowadays are not just random hex, there
> typically, not exceptionally, some real linguistic content. 

That's a matter of definition: taggers must draw the line somewhere, and
ISO 639 by its nature gives no help.  Is this email in English?  Yes.
Is the OED written in English?  Unquestionably.  Is a mere wordlist
like /usr/share/dict/words in English?  Probably.  What about a shorter
list of English words which on inspection turns out to be the reserved
words of Cobol-68?  Perhaps.  What about the text of _Finnegans Wake_?
Scholars disagree, though several translations have been made.  What about
the source of TeX, which is a complex admixture of English prose (well
beyond line-by-line comments) and Pascal code fragments?  I don't know.

A document in English may be represented in a computer as text or as a
graphic; in either case "en" is applicable.  How large a contribution
to the graphic must letterforms make before "en" is reasonable?  Is a
photograph with some English text shown in the image (on a sign, for
example) a reasonable instance of "en"?  What about a subtitle embedded
in the image?  What about a copyright notice using the word "Copyright",
which is an English word but can be used, by international agreement,
in any copyright notice in place of the copyright symbol?  What if
"Copr." is used, which is also equivalent to the copyright symbol?

There is "en" and there is "cpe", which represents the collection of
pidgin and creole languages with an English lexifier.  639-3 provides 30
such languages, or 32 if you count Bislama and Tok Pisin, which have their
own 639-2 code elements.  However, in classifying spoken-word documents
(whether recordings or transcriptions) made where a creole language
is spoken, one generally finds that they contain an extraordinarily
complex mixture of the local variety of English and of the creole, with
the proportions and details of the mixture varying with the speaker, the
audience, the subject matter, and a host of other factors.  The behavior
of "en" and "sco" both in speech and in writing is essentially similar.
It is and always will be a matter of judgement whether to use "en",
"cpe" (or "sco"), or "mul" in such situations.

In short, "English" is not a bright-line notion.  Formally, all that
ISO 639 tells us is that if you have something that you believe to be
English, you may tag it "en", and a counterparty may reconstruct the fact
that you believe it to be English.  There is a large class of documents
where all will agree that the language is English and that the "en"
tag is appropriate, and another large class where all will agree that
it is inappropriate.  In between there are judgement calls.

> but based on the wording of the standards, I don't think we can expect
> zxx to apply to typical code source.  Yet, while there may be is some
> embedded English, we don't want to call it "en" either.

Maybe we do, maybe we don't; it depends on our inividual purposes.

> It looks to me like the best choice currently would be "und";

"und" is always a correct choice, as it is a formal expression of
ignorance or indifference or both.

> I think it might be useful to have a special tag for this just
> because it is a reasonably common case that is otherwise difficult
> to categorize.

I don't think you will make any progress with 639/RA on that one, nor
is it appropriate for BCP 47 to add tags for entirely new purposes, nor
do you seem to have much traction from this mailing list for doing so.
A private-use tag, perhaps based on zxx, perhaps not, is surely suitable
for Google's private purposes.

> alternative would be to explicitly broaden the description of "zxx" to be
> "no linguistic content, or programming source code". That would be a
> compatible change to 4646bis, since it is a broadening.

It may or may not be a broadening, depending on whether you think
programming source code has linguistic content or not.

-- 
If you have ever wondered if you are in hell,         John Cowan
it has been said, then you are on a well-traveled     http://www.ccil.org/~cowan
road of spiritual inquiry.  If you are absolutely     cowan at ccil.org
sure you are in hell, however, then you must be
on the Cross Bronx Expressway.          --Alan Feuer, NYTimes, 2002-09-20


More information about the Ietf-languages mailing list