Language tags in IPP (was: Re: [Suppress-Script] Initial list of 300 languages)

Mark Davis mark.davis at icu-project.org
Tue Mar 14 02:46:40 CET 2006


First, I'm certainly not against the use of heuristics where there is no 
other available information. But they remain only heuristics, even if 
they work "surprisingly well" they are going to fail sometimes. So any 
system is far better off communicating the information rather than 
letting it be simply a guess. Even with pretty good heuristics, charset 
detection is never going to be 100%, especially with short documents, 
and even if a language (or locale) tag is available. But this is really 
getting off topic, and diverting from the real issue at hand:

The original issue was the assertion that disaster would befall us if a 
Suppress-Script were missing. Ira still hasn't provided a convincing, 
complete scenario that suddenly some data is going to be tagged 
"zu-Latn-AQ" that was formerly only tagged "zh-Latn", and thus cause his 
printer to fail.

Mark

Ned Freed wrote:
>> I'll go a bit further; the notion that a specification should
>> communicate charset mapping by using language tags instead of charset
>> tags is bizarre, and doomed to fail. Encountering a script tag is the
>> least of the problems to expect.
>
> On the contrary, it is commonly done in a variety of applications and 
> works
> surprisingly well in practice.
>
> Ira has been talking about IPP. Let me talk about email. Email isn't 
> an end to
> end protocol and hence has no ability to negotiate anything on the 
> fly. There
> are many cases where preferred language information is available but 
> charset
> information is not. Finally, "just use UTF-8" is an absolute and complete
> nonstarter for a huge number of people, our wishes to the contrary
> notwithstanding. (This is gradually changing, but Unicode ubiquity is 
> still a
> very long way away.)
>
> As a specific example, suppose I'm generating an over quota message to 
> a local
> user. I have templates for this messages composed in a variety of 
> languages and
> stored in UTF-8. I look up the user, I find their preferred language 
> and I use
> it both to select the apppropriate template to use and I then use an 
> ancillary
> table to select the appropriate charset to translate the UTF-8 to.
>
> So why wasn't the preferred charset stored in the directory along with 
> the
> language? Simple: Because users make these choices, and most users 
> don't know
> enough about their own charset usage to make an appropriate selection. 
> They are
> able to select the language they want to receive messages in, however.
>
> An obvious tweak is to store the messages in some charset other than 
> UTF-8. But
> this amounts to hardcoding the charset choice into the template - 
> you're still
> deriving the choice of charset from the language.
>
> The email system I work is a combination of several different products 
> that
> originated from different companies. I believe there used to be three 
> different
> pieces of code that did the language->charset mapping, all done at 
> different
> times by different implementors working for different companies (I 
> eliminated
> one of these during a code cleanup a few years back.) The fact that 
> different
> groups have arrived at the same outcome speaks very strongly as to the 
> utility
> of this technique.
>
> As to whether this works well in practice, we have lots of international
> customers and while they find lots of things in the i18n realm to 
> complain
> about, this isn't one of them, as least not since I made the mapping 
> table
> configurable and extensible. (All of the original variants were 
> hard-coded.)
>
>> Let me be clear; I think making the Suppress-Script data as accurate as
>> possible is all to the good. But the sky won't fall if Suppress-Script
>> is not set; we already clearly state that script and country information
>> shouldn't be included unless it is necessary. I frankly don't see that
>> people will add scripts to tags unless they are really necessary, in the
>> small handfuls of cases like zh-Hant or az-Latn.
>
> A lot of people think "more specific is better". I see lots of "en-us"
> floating around in contexts where the "-us" is totally unnecessary and 
> may in
> fact be inappropriate.
>
> More generally, I learned long ago that betting on people behaving in a
> particular way when it comes to these sorts of choices is an extremely 
> risky
> proposition. The best designs try to accomodate a wide range of possible
> behavior and don't depend on what we might hope is the likely outcome. 
> That's
> why getting this suppress-script stuff nailed down is so important, 
> and why Ira
> is absolutely right when he says that 3066bis is in jeopardy if this 
> effort
> fails.
>
>                 Ned
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages
>
>


More information about the Ietf-languages mailing list