Language tags in IPP (was: Re: [Suppress-Script] Initial list of 300 languages)

Mon Mar 13 19:46:00 CET 2006

> I'll go a bit further; the notion that a specification should
> communicate charset mapping by using language tags instead of charset
> tags is bizarre, and doomed to fail. Encountering a script tag is the
> least of the problems to expect.

On the contrary, it is commonly done in a variety of applications and works
surprisingly well in practice.

Ira has been talking about IPP. Let me talk about email. Email isn't an end to
end protocol and hence has no ability to negotiate anything on the fly. There
are many cases where preferred language information is available but charset
information is not. Finally, "just use UTF-8" is an absolute and complete
nonstarter for a huge number of people, our wishes to the contrary
notwithstanding. (This is gradually changing, but Unicode ubiquity is still a
very long way away.)

As a specific example, suppose I'm generating an over quota message to a local
user. I have templates for this messages composed in a variety of languages and
stored in UTF-8. I look up the user, I find their preferred language and I use
it both to select the apppropriate template to use and I then use an ancillary
table to select the appropriate charset to translate the UTF-8 to.

So why wasn't the preferred charset stored in the directory along with the
language? Simple: Because users make these choices, and most users don't know
enough about their own charset usage to make an appropriate selection. They are
able to select the language they want to receive messages in, however.

An obvious tweak is to store the messages in some charset other than UTF-8. But
this amounts to hardcoding the charset choice into the template - you're still
deriving the choice of charset from the language.

The email system I work is a combination of several different products that
originated from different companies. I believe there used to be three different
pieces of code that did the language->charset mapping, all done at different
times by different implementors working for different companies (I eliminated
one of these during a code cleanup a few years back.) The fact that different
groups have arrived at the same outcome speaks very strongly as to the utility
of this technique.

As to whether this works well in practice, we have lots of international
customers and while they find lots of things in the i18n realm to complain
about, this isn't one of them, as least not since I made the mapping table
configurable and extensible. (All of the original variants were hard-coded.)

> Let me be clear; I think making the Suppress-Script data as accurate as
> possible is all to the good. But the sky won't fall if Suppress-Script
> is not set; we already clearly state that script and country information
> shouldn't be included unless it is necessary. I frankly don't see that
> people will add scripts to tags unless they are really necessary, in the
> small handfuls of cases like zh-Hant or az-Latn.

A lot of people think "more specific is better". I see lots of "en-us"
floating around in contexts where the "-us" is totally unnecessary and may in
fact be inappropriate.

More generally, I learned long ago that betting on people behaving in a
particular way when it comes to these sorts of choices is an extremely risky
proposition. The best designs try to accomodate a wide range of possible
behavior and don't depend on what we might hope is the likely outcome. That's
why getting this suppress-script stuff nailed down is so important, and why Ira
is absolutely right when he says that 3066bis is in jeopardy if this effort
fails.

				Ned