Sunday Afternoon Thoughts on the Design of RSS Aggregators
by William Grosso
Related link: http://www.bloglines.com
The other day, at the Emerging Technology SIG, Doug Cutting gave a talk on Lucene and Nutch. Before the talk, Doug casually mentioned that he used a server-based RSS aggregator.
Similarly, in the responses to my blog entry on RSS Aggregrators, someone mentioned they use bloglines.
This is interesting to me. In my mind, and I was probably guided by the intuition that a "web browser is a client," RSS Aggregators were naturally client side. By which I mean, my first inclination was that RSS Aggregators naturally run on the end-user's machine, rather than on a centralized server farm. There are counterexamples, though. For example,
Bloglines is an RSS Aggregator that runs out there somewhere and returns your results as a web page (and, by the way, Scott Rosenberg likes Bloglines).
Which led me to spend some time pondering: what's the boundary line between "standalone application" and "server-based" application. That is, when should an application live entirely on an end-user's machine, and when should it live on a server and be accessed through a client program (this distinction gets hazier in the case of RSS Aggregators, which are, in a loose sense, web-clients anyway).
The classic reasons for making an application a server-based application are:
- Application Load. The application has some memory or cpu requirement that makes end-user machines not applicable. For example, an application that briefly requires 1 Gig of memory for efficient processing of an intermediate data structure.
- Resource Sharing. The application enables users to effectively amortize the cost of some computational resource. For example, Google amortizes the cost of spidering and indexing.
- Data Sharing. Many users, or applications, are using the same data set. In addition to search engines (sharing the index), this is the classic database-driven application. In addition, things like an authentication server ("single signon") live here.
- Connectivity requirements. The app has to be there on a 24 x 7 basis, or some simulation thereof. E-mail servers shouldn't go off line (as end-user machines often do).
- Manageability (it's often easier to manage a data center whose configuration you control than it is to repeatedly deploy complex functionality on thousands of desktops).
- Accessibility. It's easier to access your information if it's stored in a central repository. It's easier to access an application if it's running on a server.
- Security. If some information needs to have access restricted, it's easier to manage that control centrally.
The classic reasons for making an application stand-alone are:
- Responsiveness. A local application has the potential for a better user experience. Any time you insert round-trips to a server, you add the potential for the user to wonder "What's it doing?"
- Application load. While the individual client might not need a lot of resources, the overhead of serving many clients can overwhelm a server-based design.
- Sheer performance. Some applications (read: games and complicated spreadsheets) are simply infeasible in a server-based model. This is actually a combination of the first two, but I think it deserves its own bullet point.
- Personal information. It's difficult to store deep amounts of personal context on a server. If an application truly benefits from a large amount of personal context, then it's probably a standalone application.
- Security. The user might have qualms about storing personal data somewhere remote. In addition, a security hole can compromises many people in one exploit.
- Standalone aspects. What if the machine isn't connected to the server? If someone is going to be intermittently connected, or in low-bandwidth situations, standalone might be the way to go.
Of course, I'm blurring the lines and ignoring fat clients that do more than provide a better gui (e.g. which slide some "server" functionality over the client). It's a simple list. And there's nothing in here about P2P applications or the ways in which the faster release cycles engendered by web-based applications can be a significant competitive advantage. But I still think it captures a lot of the considerations and so I'd like to ask:
Did I miss anything in these lists?
Let's start by making the easy comparisons. From the end-user's perspective, the standalone approach has the following advantages:
- A richer user interface (although note that Tim Bray doesn't think this is obvious).
- Better performance on small feed sets. There's a caveat here: I've only played around with small feed sets on current applications (approx 100 feeds) where the feeds get updated frequently. If you have a lot of feeds which are infrequently updated, the bandwidth of fetching old feeds might be significant (unless people are starting to use last-modified again, which would be nice).
From the developer's perspective, the standalone approach has the following advantages:
- No need to worry about scalability concerns.
- No need to create an administer a server farm.
- Better support from IDEs and other development tools.
From the end-user's perspective, the server-based approach has the following advantage:
- Location and OS transparency. You can use it from anywhere (or, at least, from any PC. There's not a lot of "use it from the cellphone" going on yet).
- Ability to use a customized browser (for example, one with advanced pop-up blocking, tabbed browsing, or searching functionality). Similarly, integration into the user's standard browser (ability to bookmark an article for later) seems like an advantage.
From the developer's perspective, the server-based approach has the following advantages:
- No need to worry about deployment of complex applications to uncontrolled environments.
- Ability to use large, server-side libraries and pieces of functionality.
- Fast release cycles. The ability to quickly modify and update code.
Applying the Server-Based / Standalone Bullet Points
With that out of the way, let's talk horse-racing. Given that you can build an RSS aggregator that's server-based or standalone, how do they compete with each other? How will they evolve?
How do you, as the designer of bloglines, make your application compelling? Well, you want to build something that is a classic server-based application (cause you're server based and it makes sense to leverage that). You want to add features that require resource sharing, data sharing, or connectivity (you've already got the accessibility thing nailed).
What do those look like? You might think connectivity's a nice one. If you can stay up 24 x 7, and you can cache RSS feeds, then people can find out about blogs which are currently off-line, but have changed. The problem is: this assumes the feed indicated a change, but then the site went off-line. And if a user is interested enough to wonder whether a feed changed, they might want to be able to fetch the article. Which means this isn't that big an advantage (the feed, or the site, being down is pretty much a bummer, unless your aggregator's going to cache a lot of data for people).
Data sharing? Well, there's potential here in that the RSS feeds are fetched much less often. This is a very good thing for authors with low-capacity servers and interesting weblogs. But it's not so compelling for the end-user. Unless we run into a scenario where a significant percentage of weblog's are up, but responding slowly. Or, a scenarios which is perhaps more likely, a significant percentage of weblogs decide to give higher priority to server-based RSS feeds on the theory that doing so will decrease their overall load.
Resource sharing? Here's where the server-based designs have a chance to shine. Bloglines has features like Top Blogs, Blog Recommendations, and the ability to subscribe to a search which are hard to imagine incorporating into a standalone design.
I think these resource sharing functions are the compelling advantage bloglines has. The interesting thing is, of course, that other applications which aren't RSS Aggregators (like Feedster) also offer some of them.
How about the other side? How do you, as the designer of FeedDemon, make your application compelling? Well, you want to build something that is a classic standalone application (cause you're standalone). You want to add features that require significant personal application load, personal information, or enable you to run even when you're not connected to the net (you've already got the performance thing nailed).
The last of these is the easiest-- it probably means building a local database and having a "fetch my web" feature for offline RSS browsing. Given that even the FeedDemon help is on-line right now (the help system sends you to online help pages), it would appear that this isn't a priority (in spite of the "work offline" button, which seems to simply prevent FeedDemon from attempting to talk to the world).
"Fetch my web" seems nice even when you're on-line too. Wouldn't it be great to improve the performance of the web by having a predictive cache? Of course it would. And by subscribing to feeds, I'm telling the web browser exactly how to build the cache. The software gets simpler, and better.
In slogan form: UI is Better than AI.
How about significant load or significant personal information? What could you add to an RSS Aggregator that would make it more useful along these lines? Well, the obvious thing is memory: Suppose the RSS Aggregator not only knew about your feeds, it know about which articles you fetched over time, and was somehow taking advantage of that big database of information. Suppose you could search the database for old blog articles (though, in a shamless personal plug, I'll point out that you can do this for bloglines by incorporating the toolbar I helped build into your web browser)?
Another point, which isn't necessarily client or server based, is that applications are platforms. By building a server-based application, and relying on a web browser for your client, you are doing two things: you are limiting the extensions that third parties can make to your application to browser-based plugins AND you are enabling the existing browser-based plugins to augment your application.
On the other hand, if you built a robust plug-in architecture into your standalone aggregator, it's possible that you could harness a intermediate-to-long-term competitive advantage-- as RSS grows in importance, and we all believe it will, people will want to customize their RSS experience (on the other hand, you have to support a developer community. Uuugh).
Are RSS aggregators naturally standalone or server-based. And where do P2P, the Semantic Web, and Worse is Better fit into all of this?
server or desktop-based aggregators
I'm a heavy bloglines user. The trump card for me, to which I think you only obliquely referred to, is the fact of life that I regularly use multiple machines to check rss feeds (three regular desktops and two regular laptops). By definition, one of the things an aggregator needs to do is maintain persistent state across sessions (in order to know what's new vs. what I've already seen).
Aggregate the Aggregators?
Good article - one area you missed (but commented on) is the need for RSS synschronization between multiple devices.
I frequently move from computer to computer, platform to platform, Palm to Phone to PC...
Server Side Syndication using RSS and Drupal
I use Drupal as my aggregator.
Yes I have Bloglines and I have tried Client Side applications, but they do not do for me what I want them to do.
I have 100 feeds running automatically in the background of my site Langemarks Café (http://www.langemark.com). They provide the raw material - the information "nuggets" - that I use to write my blog - which in turn is syndicated in RSS - in a categorized format - one RSS feed for each topic.
This is the way I envision much of the future use of syndication. You find feeds, AND you share them.
Anyone can subscribe to my blog, or to any combination of categories from my blog AND anyone can subscribe to the feeds that I have - AND to any combination of these feeds.
I could not do this using a client side aggregator - because my feeds would not be available to others or directly to my blog, and they would only be updated when my computer was on.
I could not do this by using a central aggregator service like Bloglines - though it is an excellent service - because it would not be "mine" and I would not have the control of it that I have now.
I aknowledge that there are a number of questions left unanswered. I do not have the answers to all. What about copyright? What about bandwidth usage?
But right now it really works for me.