Language Registrations needed for i-unknown and i-mixed

Wed Jan 21 04:02:04 CET 2004

My next two messages will contain language registrations for "i-mixed"
and "i-unknown".
These registrations are intended to address a real problem that I'm
currently having in complying with existing protocol specifications.
Some explanation will help illuminate the situation:
	What we do at http://weblogs.PubSub.com/ is generate custom,
synthetic RSS feeds. We scan about 100K feeds continuously and let
people "subscribe" to items in those feeds. (Thus, if you want to know
every time "(RSS OR ATOM) AND (BLOG OR FEED)" is mentioned in an RSS
feed, we can help you... When we find a match we insert it into a
custom RSS file being maintained for the subscriber. (In the future,
we'll support other kinds of "delivery". Email, SOAP, XMLRPC, etc..)
Given that the collection of feeds we scan is multi-lingual, the
result is that we generate RSS feeds that contain items written in
multiple languages.
	The issue with our feeds is that we don't put <language> tags
in them. These tags are defined as optional in RSS V2.0, but there is
no question that having them improves the utility of a feed
significantly and some people consider their absence to constitute a
"broken feed.". Worse, some programs simply default to "English" when
they don't see a language tag and this can result in unexpected
behavior. 
	Our dilemma is that RSS appears to have been defined with the
assumption that all items in a feed would share a common language.
This is a usually good assumption when RSS is being used to syndicate
the content of a blog being maintained by a single person, however, it
doesn't work well when the feed is composed of items sourced from
thousands of other feeds. Ideally, we would have a <language> tag on
items -- not a single tag for the whole RSS file. Unfortunately, RSS
V2.0 -- like many other protocols -- doesn't define item-level
<language> tags... Now, clearly, we could define some new namespace
and create an item-level <language> tag of our own like
"<ps:language>". The difficulty with doing so is that this private tag
wouldn't achieve much more than wasting bandwidth since no known news
aggregator knows what to do with it. This is the case, of course, with
many "extensions" to XML formats... They work within small groups, but
are simply noise when the scope of usage expands since no one supports
them. Even if we did define a new tag for use in the XML-based RSS
format, we would still be faced with much the same problem when we
start sending content via email. The best we could do in that
situation is define a "private use" language code like "x-unknown" or
"x-mixed", however, this doesn't promise to be very useful.
	It has been suggested that we should do a scan of the
generated feed and determine what language is most commonly used in
the various items that have been collected. However, I don't think
this gets us to any place useful. The problem is that while this might
mean that the channel-level language tag is right for many items, it
will still be wrong for many other items. Also, this means that the
<language> for one of our RSS channels could be changing from minute
to minute as content of one language or another ebbs and flows into
the generated feed.
	Our interface allows people to create subscriptions that
restrict the content that is scanned for them to only those that are
marked as being in some specific language. We should probably insert
<language> tags into such single language feeds, but we are then still
left with the issue of what we should do for subscriptions that
specify "any language" as the content source...
	In order to address the issue of "any language" subscriptions,
etc., I'm requesting that we be able to use "i-unknown" and/or
"i-mixed" when appropriate.
	Alternative solutions would be welcomed.

		bob wyman