Language tags in IPP (was: Re: [Suppress-Script] Initial list of
dewell at adelphia.net
Mon Mar 13 01:13:51 CET 2006
Ira McDonald <imcdonald at sharplabs dot com> wrote:
> For six years, almost all network printers have supported
> IETF IPP/1.1 (RFC 2911), which DOES externally tag the
> language of print data streams - the CUPS spooler that is
> now ubiquitous in Linux distributions, standard in MacOS,
> and common in commercial UNIX distributions uses IPP for
> the print protocol.
I took a look at RFC 2911. Wow, 224 pages. And to think people
criticized the initial-registry draft because the list was 106 pages
long and we wanted it to be an RFC.
>From a cursory reading (I didn't have time to slurp in the whole thing),
it looks like IPP uses the language tag to determine the language in
which internal attributes are expressed and status messages are to be
It also seems to have some interrelationship with the character set of
the print job, which seems wrong to me; figuring out which character
repertoires are necessary for which natural languages is a decidedly
non-trivial effort (ask Michael, who has done this work for the European
> If you send your print job in Unicode (UTF-8 or UTF-16) to
> your laser printer _and_ the printer has sufficient fonts
> installed (for the necessary scripts), bad things won't
> happen. But if your print data is in a legacy charset
> (like almost all existing documents in the world), then
> bad things will begin to happen when unsupported 'script'
> subtags are infixed in language tags.
Again, I'm not a fan of the idea of determining character repertoires on
the basis of natural language. And I'm disappointed if, today in 2006,
it is still safe to assume that "almost all existing documents in the
world" are not in UTF-8 or another Unicode character encoding.
Section 184.108.40.206, item 2.b of RFC 2911 says:
"the Associated Natural-Language parts match if the shorter of the two
meets the syntactic requirements of RFC 1766 [RFC1766] and matches byte
for byte with the longer. For example, 'en' matches 'en', 'en-us' and
'en-gb', but matches neither 'fr' nor 'e'."
In other words, "en" will match "en-Latn-US", but "en-US" will not. So
the script subtag will not cause all language tags to break after all,
only in cases where both contain a region.
This strongly suggests to me that when we are considering adding
Suppress-Script values for up to 300 languages, we should focus
primarily on those languages that are most likely to be used with a
region subtag, and spend much less time worrying about the rest.
For example, it seems improbable to me that Hawaiian ("haw") exhibits
substantially different usage in different regions, such that tags of
the form "haw-US" or "haw-UM" would be likely to occur. That means --
as easy and uncontroversial as it would be -- we need not spend time
worrying about adding a Suppress-Script of "Latn" for Hawaiian. It
would be more productive to focus our attention on a language like
Santali, which is spoken in multiple regions, and for which a "default"
script assignment is not obvious.
Fullerton, California, USA
More information about the Ietf-languages