Language tags in IPP (was: Re: [Suppress-Script] Initial list of 300 languages)

Mon Mar 13 20:20:48 CET 2006

Hi Doug, Peter, Mark, et all.

Actually RFC 2911 (IPP/1.1 Model and Semantics) is only one
of more than twenty-five IETF and IEEE-ISTO PWG standards that
now define IPP.  For the complete set, visit:

  http://www.pwg.org/ipp/index.html

In turn, the CIP4 JDF production printing job ticket standard
(supported in the last two major versions of Adobe Acrobat and
all high-end production printers), the Bluetooth Print Profile,
UPnPv1 and UPnPv2, and dozens of other printing standards
normatively reference these IPP documents for print semantics.

RFC 2911 didn't define the Document object and therefore
only defined a language tag AND charset tag for the external
attributes (the metadata with the print job).

IEEE-ISTO PWG 5100.5 "IPP Document Object" (October 2003)
defined 'document-natural-language' and 'document-charset'
metadata attributes to tag the _content_ of the print data.
Because newer documents transferred in a Unicode charset
(e.g., UTF-8) don't allow any meaningful guesses about
character repertoire from the charset tag, most printers
and application software for word processing have been
guessing repertoire based on the language tag.

Thus, IEEE-ISTO PWG 5101.2 "RepertoireSupported Element"
(February 2004) was created (including an IPP protocol
binding) to explicitly tag the character repertoire
needed to render a given print datastream correctly.

W3C XHTML-Print (developed by IEEE-ISTO PWG) only allows
the use of Unicode, and is being widely implemented in
PDAs, cellphones, etc.  Font foundries collaborated with
IEEE-ISTO PWG on this problem and developed PWG 5101.2.

But other "legacy" charsets have this Unicode problem
(e.g., the Chinese and Japanese standard charsets).  So
the guessing of repertoire based on charset biased by
language tag remains the common method in printers and
word processing software.

Most newer print data language (PDL) specifications and
profiles (such as ISO and IEEE profiles of PDF) do NOT
allow ANY imbedded fonts (because of portability and
long-term archive problems), so printing software will
increasingly depend on the EXTERNAL metadata (not in
the print datastream) to make rendering decisions.
This is perfectly appropriate behaviour.

PostScript internal language tags are actually rare in
usage, by the way.

Now - let's drop this whole topic, because teaching this
list how printing actually works under the covers is not 
feasible in a string of email notes.

Cheers,
- Ira

Ira McDonald (Musician / Software Architect)
Blue Roof Music / High North Inc
PO Box 221  Grand Marais, MI  49839
phone: +1-906-494-2434
email: imcdonald at sharplabs.com

> -----Original Message-----
> From: ietf-languages-bounces at alvestrand.no
> [mailto:ietf-languages-bounces at alvestrand.no]On Behalf Of Doug Ewell
> Sent: Sunday, March 12, 2006 7:14 PM
> To: ietf-languages at iana.org
> Subject: Language tags in IPP (was: Re: [Suppress-Script] Initial list
> of 300 languages)
> 
> 
> Ira McDonald <imcdonald at sharplabs dot com> wrote:
> 
> > For six years, almost all network printers have supported
> > IETF IPP/1.1 (RFC 2911), which DOES externally tag the
> > language of print data streams - the CUPS spooler that is
> > now ubiquitous in Linux distributions, standard in MacOS,
> > and common in commercial UNIX distributions uses IPP for
> > the print protocol.
> 
> I took a look at RFC 2911.  Wow, 224 pages.  And to think people 
> criticized the initial-registry draft because the list was 106 pages 
> long and we wanted it to be an RFC.
> 
> >From a cursory reading (I didn't have time to slurp in the 
> whole thing), 
> it looks like IPP uses the language tag to determine the language in 
> which internal attributes are expressed and status messages are to be 
> issued.
> 
> It also seems to have some interrelationship with the 
> character set of 
> the print job, which seems wrong to me; figuring out which character 
> repertoires are necessary for which natural languages is a decidedly 
> non-trivial effort (ask Michael, who has done this work for 
> the European 
> languages).
> 
> > If you send your print job in Unicode (UTF-8 or UTF-16) to
> > your laser printer _and_ the printer has sufficient fonts
> > installed (for the necessary scripts), bad things won't
> > happen.  But if your print data is in a legacy charset
> > (like almost all existing documents in the world), then
> > bad things will begin to happen when unsupported 'script'
> > subtags are infixed in language tags.
> 
> Again, I'm not a fan of the idea of determining character 
> repertoires on 
> the basis of natural language.  And I'm disappointed if, 
> today in 2006, 
> it is still safe to assume that "almost all existing documents in the 
> world" are not in UTF-8 or another Unicode character encoding.
> 
> Section 4.1.2.3, item 2.b of RFC 2911 says:
> 
> "the Associated Natural-Language parts match if the shorter 
> of the two 
> meets the syntactic requirements of RFC 1766 [RFC1766] and 
> matches byte 
> for byte with the longer.  For example, 'en' matches 'en', 
> 'en-us' and 
> 'en-gb', but matches neither 'fr' nor 'e'."
> 
> In other words, "en" will match "en-Latn-US", but "en-US" 
> will not.  So 
> the script subtag will not cause all language tags to break 
> after all, 
> only in cases where both contain a region.
> 
> This strongly suggests to me that when we are considering adding 
> Suppress-Script values for up to 300 languages, we should focus 
> primarily on those languages that are most likely to be used with a 
> region subtag, and spend much less time worrying about the rest.
> 
> For example, it seems improbable to me that Hawaiian ("haw") exhibits 
> substantially different usage in different regions, such that tags of 
> the form "haw-US" or "haw-UM" would be likely to occur.  That 
> means --  
> as easy and uncontroversial as it would be -- we need not spend time 
> worrying about adding a Suppress-Script of "Latn" for Hawaiian.  It 
> would be more productive to focus our attention on a language like 
> Santali, which is spoken in multiple regions, and for which a 
> "default" 
> script assignment is not obvious.
> 
> --
> Doug Ewell
> Fullerton, California, USA
> http://users.adelphia.net/~dewell/ 
> 
> 
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages at alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages
>