RSS

I’ve been developing web applications built on the RSS platform for just under two years now and I gotta say - sometimes it is painful. For a platform whose acronym means “Really Simple Syndication” and is built upon the idea of easily publishing and consuming content - it is surprisingly difficult to work on. Unfortunately there is no one person or company to go to and say, fix this.

The problem lies in the hands of many - from publishing platform providers (a la Typepad, Wordpress, Blogger) to the content providers (you and me). We’ve been working on an iowa blog networkIowa Blogs (yes, that’s a sneak peek) for a little while now along with some other communities as well as RSS based applications we are developing. Here are some of the most common problems with using syndication platforms as platforms for web applications (web 3.0 ?).

—-

1. Feeds Are Invalid
For the latest version of Iowa Blogs we are offering ‘Featured’ content to accompany the directory that we have already established. So, as acting editor and developer it was my job to filter through the directory of Iowa Bloggers to find the blogs to feature - based on many criteria. For each ‘community’ I created an OPML file with which the application would be built on.

Thinking nothing of it I threw the OPML file at my web application only to find that it broke it. After hours and hours of debugging my code trying to find what the problem is I find it - an invalid RSS feed in the OPML file. My solution? Remove it from the OPML file. I have to say I was surprised by the amount of feeds that were invalid. In 2003 Mark Pilgrim noted that around 10% of feeds are malformed. I bet in the last four years that number has at least doubled.

My advice to you? Please, please, please make sure your Syndication (RSS, ATOM, RDF) feed is valid. Go get your feed a check up at FeedValidator. Many things can cause an invalid feed - especially the tool you use to post. For you techies - make sure to use properly encoded HTML, valid XML entities, and the right character encoding (UTF-8).

2. Date Standard
Part of the ‘hacking’ I’ve done is mashing up feeds in a ‘News River‘ fashion which requires ordering items chronologically. Sounds easy right? That’s what I thought until my code, again, kept breaking. This is a problem Dan York discovered (not that no one else had) while working with Yahoo! Pipes. In his post, ‘Yahoo!Pipes and its dating problem… (and a failure of RSS standardization)‘, he noticed what he called (and I agree) a fundamental problem.

If you dig down into the actual RSS feed, you’ll see the fundamental problem faced by Yahoo (or anyone else trying to mash up different RSS feeds). Here is the date associated with an entry from Disruptive Telephony, a TypePad blog:

pubDate 2007-03-05T14:37:34-05:00

Here’s the date from an entry from Voice of VOIPSA, a WordPress blog:

pubDate Mon, 05 Mar 2007 16:14:52 +0000

Here’s the date from an entry from my LiveJournal account:

pubDate 2007-03-01T00:00:00-06:00

Here’s the date from a RSS feed item from Twitter:

pubDate Mon, 05 Mar 2007 19:48:48 +0000

Here’s the date from a RSS feed entry from Blue Box: The VoIP Security Podcast, also a TypePad blog:

pubDate Thu, 22 Feb 2007 22:39:48 -0600

If you’re non-technical here’s the problem - the dates are in different formats. Not only that there’s not a standard way for denoting the date within the RSS feed. Some use pubDate, some use dc:date, etc. So when mashing up feeds from different platforms (Wordpress, Typepad, Blogger) which are then in differing formats (RSS, ATOM, RDF) we start to have a problem.

3. Dead Feeds
In some cases when a blog is discontinued (I think Blogger is the culprit) the request for the feed no longer responds. At least give a response that there is no content instead of making our applications break. Needless to say it’s another issue I had to ‘code around’. For you platform providers - please use the appropriate HTTP status codes to let us know what happened to the item we are requesting.

4. Where’s the content?
As with date discussed above different platforms and different specifications use varying ways to denote the actual content of a post. The content can either be found in the content:encoded tag or the description tag. Some feeds have one or the other and some have both. Arve Bersvendsen wrote a post a few years ago describing the vaious ways content exists within a syndicated feed - just on the RSS platform. To take it even further once you find the content it is often invalid markup.

My preference is to have both description and content. Description should contain a 200-300 character summary of the post while content should contain the entire post.

5. Haven’t I Seen this Before?
We’ve all seen duplicate posts in feeds, which sucks for someone developing an aggregator. This can happen for many, many reasons but there are ways you the publisher can help solve it. If you have to re-post, please use the same title. Also in each platform there is an option to let the aggregator know an entry is distinct. In RSS 1.0 it is the rdf:about tag; in RSS 2.0 it is the guid; in Atom it is an id.

6. Timeout
Another thing I hate is sitting and waiting. What am I waiting for? Your feed to load. This is especially an issue when generating a news river of several mashed up feeds. If the request for your feed times out my application breaks - another thing I get to ‘code around’. You can help solve this by taking the proper steps to cache your feeds. Or just let FeedBurner do the work.

—-

There are several inherent problems with using RSS as a platform and we’ve just begun to discover them. These are things we all need to work on to help make the platform grow and maintain its viability as the next generation of content syndication.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Bumpzee
  • del.icio.us
  • Facebook
  • Furl
  • Mixx
  • NewsVine
  • Reddit
  • StumbleUpon
  • YahooMyWeb
  • Google