Language tags in IPP (was: Re: [Suppress-Script] Initial list of 300 languages)

Mon Mar 13 01:13:51 CET 2006

Ira McDonald <imcdonald at sharplabs dot com> wrote:

> For six years, almost all network printers have supported
> IETF IPP/1.1 (RFC 2911), which DOES externally tag the
> language of print data streams - the CUPS spooler that is
> now ubiquitous in Linux distributions, standard in MacOS,
> and common in commercial UNIX distributions uses IPP for
> the print protocol.

I took a look at RFC 2911.  Wow, 224 pages.  And to think people 
criticized the initial-registry draft because the list was 106 pages 
long and we wanted it to be an RFC.

>From a cursory reading (I didn't have time to slurp in the whole thing), 
it looks like IPP uses the language tag to determine the language in 
which internal attributes are expressed and status messages are to be 
issued.

It also seems to have some interrelationship with the character set of 
the print job, which seems wrong to me; figuring out which character 
repertoires are necessary for which natural languages is a decidedly 
non-trivial effort (ask Michael, who has done this work for the European 
languages).

> If you send your print job in Unicode (UTF-8 or UTF-16) to
> your laser printer _and_ the printer has sufficient fonts
> installed (for the necessary scripts), bad things won't
> happen.  But if your print data is in a legacy charset
> (like almost all existing documents in the world), then
> bad things will begin to happen when unsupported 'script'
> subtags are infixed in language tags.

Again, I'm not a fan of the idea of determining character repertoires on 
the basis of natural language.  And I'm disappointed if, today in 2006, 
it is still safe to assume that "almost all existing documents in the 
world" are not in UTF-8 or another Unicode character encoding.

Section 4.1.2.3, item 2.b of RFC 2911 says:

"the Associated Natural-Language parts match if the shorter of the two 
meets the syntactic requirements of RFC 1766 [RFC1766] and matches byte 
for byte with the longer.  For example, 'en' matches 'en', 'en-us' and 
'en-gb', but matches neither 'fr' nor 'e'."

In other words, "en" will match "en-Latn-US", but "en-US" will not.  So 
the script subtag will not cause all language tags to break after all, 
only in cases where both contain a region.

This strongly suggests to me that when we are considering adding 
Suppress-Script values for up to 300 languages, we should focus 
primarily on those languages that are most likely to be used with a 
region subtag, and spend much less time worrying about the rest.

For example, it seems improbable to me that Hawaiian ("haw") exhibits 
substantially different usage in different regions, such that tags of 
the form "haw-US" or "haw-UM" would be likely to occur.  That means --  
as easy and uncontroversial as it would be -- we need not spend time 
worrying about adding a Suppress-Script of "Latn" for Hawaiian.  It 
would be more productive to focus our attention on a language like 
Santali, which is spoken in multiple regions, and for which a "default" 
script assignment is not obvious.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/