draft-phillips-langtags-08, process, specifications, "stability", and extensions

Thu Jan 6 16:37:27 CET 2005

> At 06:44 -0800 2005-01-06, ned.freed at mrochek.com wrote:

> >No, what there has been is a lot of discussion of a real problem with no
> >apparent recognition of it as such by the draft authors.

> I have no idea what the problem is, unless we are back at a bunch of
> people who want these tags and subtags to do more than what they are
> supposed to do, which is to identify varieties of language.

The ability of language tags to precisely express language varities is of great
interest to many communities. However, there are other communities that have
other interests in these tags. For example, there are many applications where
language tags are used as part of a content selection strategy. For such
applications the ability of these tags to express subtle distinctions in
languages is of secondary interest at best; the ability to match tags and
assess the relative value of a particular match is much more important.

I hope you can see, or can come to see, that there's more at stake here than
simply identifying things.

Perhaps it would help for me to describe an application of language
tags I have to deal with. The goal is to generate a message based on two
sets of information:

(1) An LDAP directory entry containing a number of attributes describing
    various things to put in the message, e.g. the author's name, the
    subject line, the message content, etc. LDAP supports tagged
    attribute variants, and one distinguished type of tag (in fact
    it is the ONLY form that's actually used in my experience) is
    a language tag. So the entry might have, say, an en variant, a
    fr-FR variant, and an fr-CA variant for the subject line.

(2) Some information about the message recipient. Exactly what's
    available varies, but can include:

    (1) One or more lists of preferred languages the recipient has
        provided.
    (2) A list of languages used in the message this message is replying
        to.
    (3) An explicit indication of the recipient's country of origin.
    (4) A domain name, possibly containing a top-level country code.

The subtask, then, is to select the right set of attributes from the LDAP
entry using whatever information is available about the recipient.

Of course varying amounts of effort can be brought to bear on this problem,
ranging from simply selecting some set of attributes at random and paying no
attention to any recipient information to selecting based on exact tag match,
selecting based on longest match to selection taking country information into
consideration along with language tag information, to table driven schemes that
allow for complex specification of matching interactions.

It is also fair to say that anything beyond random selection is probably
acceptable to quite a few users. But as you know, there are people out there
who take this stuff very, very seriously, and who will favor products that
go the extra mile. Even so, the table driven approach is more effort than
its worth in most cases. Longest match either with our without country
code considerations are very reasonable choices, however.

In any case, the recurring issue with country codes is simply this: Anyone who
wants to process them as such who based their code on 3066 and who didn't use
tables wrote code that treated the second subtag and the second subtag only as
a country code. The new specification changes this principle, and that breaks
implementations done in good faith to the old specification, in the process
potentially creating interop problems.

Additionally, arguments that such implementations are not much in evidence, let
alone in abundance, do not pass muster. The Internet is a very big place and
RFCs travel we know not where. As the author of any number of RFCs I am
contacted, not on a daily basis but close, by people with questions from all
over the world implementing all sorts of wild stuff. (It doesn't help that my
email address is the only remaining valid one in several of the documents.) I
none of these people prior to their contacting me, none of them participate
directly in the IETF, but they are writing code based on RFCs nevertheless. We
do this community a huge disservice when we change things for no good reason,
and it really isn't possible for us to assess the amount of damage we cause
when we do this sort of thing.

> Almost everything that comes through to this list I have been
> deleting unread for months, as it has nothing to do with my function
> here.

Well, perhaps it doesn't matter to you, but I believe registrations should take
into account the need to perform matching operations. The way the various sign
languages were registered, for example, is somewhat problematic for matching
algorithms. In retrospect I think it was a mistake and the failure to consider
the matching issue as part of the registration was at least part of the cause.

> It goes on, and on, and on, and on. And now it's getting nasty
> and personal.

Indeed it is. I have to say I do not at all appreciate being told that my
comments are "noise" and that my concerns are "overblown". (I'm using myself as
example here only; I'm in no way claiming I'm the only one who has been
subjected to such characterizations, nor am I claiming I've suffered the worst
abuse.) The temptation to reply in kind has been very strong; I hope I have not
given in to it.

> Perhaps we should NOT revise the damn thing at all, and stick with
> the current RFC.

Well, to be perfectly blunt, that would be a fine outcome as far as I'm
concerned. However, I happen to believe that a balance needs to be struck
between the desire for accurate, structured language labels and the need to be
able to easily process those labels in a backwards-compatible manner. I do
believe that 3066 hit a sweet spot in this regard, however, I'm receptive to
changes to increase the structure and/or expressiveness of these labels as long
as we don't have to sacrifice backwards compatibility or processability along
the way.

				Ned