New Last Call: 'Tags for Identifying Languages' to BCP

Mon Dec 13 04:40:20 CET 2004

>  Date: 2004-12-12 19:20
>  From: Mark Crispin <mrc at cac.washington.edu>
>  To: ietf-languages at alvestrand.no, ietf at ietf.org
>  
> On Sun, 12 Dec 2004, Bruce Lilly wrote:
> > If by international agreement, 'yz' becomes the designation
> > for that country, then it is rather silly to stick one's
> > fingers in one's ears and shout "NA-NA-NA-NA-NA I don't want
> > to hear you".
> 
> What is silly is saying that every language tag has to have a date/time 
> attribute associated with it so that computer software managing that text 
> knows the language of that text.

In the specific cases of the core Internet protocols that
I have mentioned, there *is* a date/time attribute in the
form of an RFC [2]822 Date field.  If we're talking about
some file stored on some machine, every OS that I know of
has a date/time stamp associated with that file.  If you
have something else in mind, a concrete description and/
or example might help.

> It is a disaster for language identifiers to get recycled.  Something has 
> to make those identifiers unique.  Your notion will force the inclusion of 
> a date/time stamp in language tags, to restore the uniqueness that you are 
> so excruciatingly eager to abolish.

I'm not "eager to abolish" "uniqueness".  There never was
any guarantee that codes would never change. Both RFCs
1766 and 3066 specifically mention changes as a fact of
life.

> > Never
> > mind the shortcomings of that particular example; consider
> > "de-DE" -- does that mean Germany as it exists today, West
> > Germany as it existed 25 years ago, Germany as it existed
> > in the 1930s, the 1900s, ...?
> 
> For the 98% case, it does not matter at all.
> 
> But it does matter if, one day, "DE" becomes Denmark.

In either case, to understand precisely what geographical
area is referred to requires knowing the date to more or
less degree of accuracy.

> > As far as I can tell, the draft pretends that the meaning
> > of "CS" hasn't changed, and would in fact change the meaning
> > of the currently valid RFC 3066 language tag "sr-CS".
> 
> No, it restores the previous meaning of sr-CS.

But what of the current meaning under the current
standard (RFC 3066 + ISO 639 + ISO 3166)?  Surely
the draft would change the meaning of that valid
RFC 3066 language-tag.

> > It is very different; under the proposed draft, there is only
> > an English definition, somebody wishing to provide a French
> > definition finds that he has none and must resort to an
> > unofficial translation.
> 
> Why is the situation for French different from someobody wishing to 
> provide a Lower Slobbobian definition?

French is an official language used by the ISO in its
publications.  "Lower Slobbobian" is probably about as
meaningful as "BLURDYBOOP".

> > SO where are the French definitions?
> 
> Ask a person who is bilingual in English and French to provide one.

That would lack definitiveness which characterizes the
ISO lists.

> > Well, sure. But the name is an important thing by itself.
> > It is rather pointless to ask a user to indicate the
> > language of a piece of text by selecting from a list "AB, ACE,
> > ACH,..., ZHA, ZUL, ZUN" -- the user doesn't normally refer to
> > languages by codes. It's quite a different matter to ask the
> > user to select from "Abkhaze, Aceh, Acoli,..., Zhuang (Chuang),
> > Zoulou, Zuni".
> 
> Abkhaze, Aceh, Acoli,..., Zhuang (Chuang), Zoulou, and Zuni are not 
> language tags.  So what's your point?

They are the human-readable names corresponding to codes.
For interoperability, it is insufficient to label any and
all languages as "ZZ" with no definition of what "ZZ"
means. Moreover, it is necessary for two (or more) communicating 
parties to *agree* on the meaning of "ZZ"; that is done
by assigning the code "ZZ" to an agreed-upon name.  The
code "ZZ" is nothing more than shorthand for that agreed-upon
name.  If one produces some text in the BCP 18 sense of "text"
(spoken, written, signed, etc.), it is useful to indicate
the language of that text; languages are known to humans by
names of languages -- the codes are, as noted, merely
shorthand for those names.  Likewise, somebody presented
with some text may desire or need to know the language of
that text; informing that person that the language has code
"QZ" is unlikely to mean anything to most people -- only
the name corresponding to the shorthand code is likely to
be meaningful to persons other than those involved in
standardizing the codes.

> >> Note that the RFC 3066 specifies a registry that does not include French
> >> language names. I suggest that this issue should be dropped.
> > Yes, the current IANA registry has that problem for
> > the non-ISO-based tags only. If the registry is to be
> > changed to subsume ISO codes as well, that defect should
> > be remedied.
> 
> Why is it a problem?  Why is it a defect?

Because it unnecessarily reduces by 50% the information
content currently available.

> > On the contrary, it is preposterous to suggest that codes
> > will be attached to text by magic
> 
> Here is where you are misled.  Many of these tags are embedded within the 
> text itself.  That text may long outlive its author in an archive.

Which is precisely why the code by itself is meaningless
without the associated language name.  If I write "blurfl
(lang=QZ)" in a hypothetical diary, that will be
incomprehensible unless the meaning of "QZ" is known.

You have not explained how the code came to be "embedded
within the text itself" -- surely the author didn't say
(or write, or sign) "this text is in language QZ"; most
likely the language was indicated by name, or by some proxy
representing the name (such as a locale).