O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Spidering Hacks
By Kevin Hemenway, Tara Calishain
October 2003
More Info

HACK
#69
Aggregating RSS and Posting Changes
With the proliferation of individual and group weblogs, it's typical for one person to post in multiple places. Thanks to RSS syndication, you can easily aggregate all your disparate posts into one weblog
The Code
[Discuss (1) | Link to this hack]

You might have heard of RSS. It's an XML format that's commonly used to syndicate headlines and content between sites. It's also used in specialty software programs called headline aggregators or readers. Many popular weblog software packages, including Movable Type (http://www.movabletype.org) and Blogger (http://www.blogger.com), offer RSS feeds. So too do some of the content management systems—Slashcode (http://slashcode.com), PHPNuke (http://phpnuke.org), Zope (http://www.zope.org), and the like—that run some of the more popular tech news sites.

If you produce content for various people, you might find your writing and commentary scattered all over the place. Or, say you have a group of friends and all of you want to aggregate your postings into a single place without abandoning your individual efforts. This hack is a personal spider just for you; it aggregates entries from multiple RSS feeds and posts those new entries to a Movable Type blog.

Running the Hack

To run the code, you'll need a Movable Type weblog. At the very least, you need the username, password, XML-RPC URL for Movable Type, and the blog ID (normally 1 if you have only one). Here's an example of connecting to Kevin's Movable Type installation to show a list of categories to post to (the --showcategories switch is, strangely enough, showing the categories):

% perl myrssmerger.pl -s http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi -u 
morbus -p HAAHAHAH -b 1 --showcategories

The output looks like this:

----------------------------------------------------------------------
 The following blog categories are available:

 1: Disobey Stuff
 2: The Idiot Box
 3: CHIApet
 4: Friends O' Disobey
 5: Stalkers O' Morbus
 6: Morbus Shoots, Jesus Saves
 7: El Casho Disappearo
 8: TechnOccult
 9: Potpourri
 10: Collected Nonsensicals

Category ID's can be used for --catid or -c.
----------------------------------------------------------------------

If you have no categories, you'll be told as such. When you're actually posting to the blog, you can choose to post into a category or not; if you want to post into Disobey Stuff, use either -c 1 or --catid 1 when you run the program. If you want no category, specify no category.

Let's take a look at a few examples of how to use the script. Say Kevin wants to aggregate all the data from all the places he publishes information. Every night he'll use cron to run the script for various RSS feeds. Here's an example:

% perl myrssmerger.pl --server [RETURN]
http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi [RETURN]
--username morbus --password HAAHAHAH --blogid 1 --catid 1 
http://gamegrene.com/index.xml

In this case, he's saying, "Every night, check the Gamegrene RSS files for entries posted today. If you see any, post them to Disobey Stuff" (which is the first category, referenced with the --catid 1 switch). He can then run the script again, only for a different RSS feed with a different category switch, and so on. Let's take a look at the output of the Gamegrene example:

----------------------------------------------------------------------
Downloading RSS feed at http://gamegrene.com/index.xml...
 Publishing item: 'RPG, For Me'.
 Skipping (failed date check): 'Just Say No To Powergamers'.
 Skipping (failed date check): 'Every Story Needs A Soundtrack'.
 Skipping (failed date check): 'The Demise of Local Game Shops'.
 Skipping (failed date check): 'Death Of A Gaming System'.
 Skipping (failed date check): 'What Do You Do With Six Million Elves?'.
----------------------------------------------------------------------

As you can see, the script checks the dates in the RSS feed to make sure they're new before the items are added to the Movable Type weblog. Dates are determined from the <dc:date> entry in the remote RSS URL; if the feed doesn't have them, the script won't function correctly.

What happens when you want to check many RSS feeds but you want to add them all to the same category? You can do that by running the script one time. Say you want to check three different RSS feeds, not necessarily all yours. Here's an example of Kevin checking three feeds (including Tara's) and adding new additions to the category:

% perl myrssmerger.pl --server [RETURN]
http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi [RETURN]
--username morbus --password HAAHAHAH --blogid 1 --catid 4 [RETURN]
http://gamegrene.com/index.xml http://researchbuzz.com/researchbuzz.rss 
http://camworld.com/index.rdf

The shortened output looks like this:

----------------------------------------------------------------------
Downloading RSS feed at http://gamegrene.com/index.xml...
 Skipping (failed date check): 'RPG, For Me'.
 Skipping (failed date check): 'Just Say No To Powergamers'.
 Skipping (failed date check): 'Every Story Needs A Soundtrack'.
----------------------------------------------------------------------
Downloading RSS feed at http://camworld.com/index.rdf...
 Publishing item: 'Trinity's Hack from Matrix Reloaded'.
 Skipping (failed date check): 'Siberian Desktop'.
 Skipping (failed date check): 'The Sweet Hereafter'.
----------------------------------------------------------------------
Downloading RSS feed at http://researchbuzz.com/researchbuzz.rss...
 Skipping (no description/date): 'Northern Light Coming Back?'.
 Skipping (no description/date): 'This Week in LLRX'.
----------------------------------------------------------------------

Note that Tara's feed fails usage by this script; that's because she's generating her RSS by hand and her feed doesn't have dates. Most program-generated feeds, like those of Movable Type, have dates and descriptions and will be just fine.

As you can see, we can choose a variety of feeds to use and we can post them to any of our Movable Type categories. Is there anything else this script can do? Well, actually, yes; it can filter incoming entries that match a specified keyword. To do that, use the --filter switch. As an example, this script posts only those entries whose descriptions include the string "perl":

% perl myrssmerger.pl --server [RETURN]
http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi [RETURN]
--username morbus --password HAAHAHAH --blogid 1 --catid 4 --filter "perl" [RETURN]
http://camworld.com/index.rdf

Hacking the Hack

Actually, this is both a "hacking the hack" and "some things to consider" section. Right now, the biggest downside is that this hack works only on Movable Type. You could dive into Net::Blogger a bit and make it usable by Blogger (http://www.blogger.com), Radio Userland (http://radio.userland.com/), or any one of the other weblogging platforms.

This script is designed to run once a day. To that end, the script does a full download of the RSS feed every time. As it stands, you should probably run it just once a day, for two reasons:

  • If you run the script more than once a day, you might have bandwidth issues running the script and downloading full RSS files too often.

  • The more often you run the script, the more often you're going to post repetitive items.

All right, let's talk about a couple of actual hacks. First is error checking; as is, the script doesn't check the URLs to make sure they start with http://. That's easily solved; just add the code in bold:

# loop through each RSS URL.
foreach my $rss_url (@ARGV) {

    # not an HTTP URL.
    next unless $rss_url =~ !^http://!;

    # download whatever we've got coming.

Next, the preface and the anteface (i.e., the text that surrounds the posted entry) are hardcoded into the script, but we can change that via a switch on the command line. First make the preface and anteface command-line options:

GetOptions(\%opts, 'server|s=s',      # the POP3 server to use.
                   'username|u=s',    # the POP3 username to use.
                   'password|p=s',    # the POP3 password to use.
                   'blogid|b=i',      # unique ID of your blog.
                   'catid|c=i',       # unique ID for posting category.
                   'showcategories',  # list categories for blog.
                   'filter|f=s',      # per item filter for posting?
                   'preface|r=s',    # the preface text before a posted item
                   'anteface|a=s"    # the text included after a posted item
);

You'll then need to make a change to the preface line:

my $preface = $opts{preface} || "From <a href=\"$clink\">$ctitle</a>:\n\n<blockquote>";

and a similar change to the anteface line:

my $anteface = $opts{anteface} 
    || "</blockquote>\n\n"; # new items as quotes.

The Code

You'll need LWP::Simple, Net::Blogger, and XML::RSS to use this. Save the following code to a file named myrssmerger.pl:

#!/usr/bin/perl -w
#
# MyRSSMerger - read multiple RSS feeds, post new entries to Movable Type.
# http://disobey.com/d/code/ or contact morbus@disobey.com.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
my $VERSION = "1.0";
use Getopt::Long;
my %opts;

# make sure we have the modules we need, else die peacefully.
eval("use LWP::Simple;");  die "[err] LWP::Simple not installed.\n" if $@;
eval("use Net::Blogger;"); die "[err] Net::Blogger not installed.\n" if $@;
eval("use XML::RSS;");    die "[err] XML::RSS not installed.\n" if $@;

# define our command line flags (long and short versions).
GetOptions(\%opts, 'server|s=s',      # the POP3 server to use.
                   'username|u=s',    # the POP3 username to use.
                   'password|p=s',    # the POP3 password to use.
                   'blogid|b=i',      # unique ID of your blog.
                   'catid|c=i',       # unique ID for posting category.
                   'showcategories',  # list categories for blog.
                   'filter|f=s',      # per item filter for posting?
);

# at the very least, we need our login information.
die "[err] XML-RPC URL missing, use --server or -s.\n" unless $opts{server};
die "[err] Username missing, use --username or -u.\n"  
    unless $opts{username};
die "[err] Password missing, use --password or -p.\n"  
    unless $opts{password};
die "[err] BlogID missing, use --blogid or -b.\n"      unless $opts{blogid};

# every request past this point requires
# a connection, so we'll go and do so.
print "-" x 76, "\n"; # visual separator.
my $mt = Net::Blogger->new(engine=>"movabletype");
$mt->Proxy($opts{server});       # the servername.
$mt->Username($opts{username});  # the username.
$mt->Password($opts{password});  # the... ok. self-
$mt->BlogId($opts{blogid});      # explanatory!

# show existing categories.
if ($opts{showcategories}) {

    # get the list of categories from the server.
    my $cats = $mt->mt()->getCategoryList(  )
      or die "[err] ", $mt->LastError(  ), "\n";

    # and print 'em.
    if (scalar(@$cats) > 0) {
        print "The following blog categories are available:\n\n";
        foreach (sort { $a->{categoryId} <=> $b->{categoryId} } @$cats) {
            print " $_->{categoryId}: $_->{categoryName}\n";
        }
    } else { print "There are no selectable categories available.\n"; }

    # done with this request, so exit.
    print "\nCategory ID's can be used for --catid or -c.\n";
    print "-" x 76, "\n"; exit; # call me again, again!

}

# now, check for passed URLs for new-item-examination.
die "[err] No RSS URLs were passed for processing.\n" unless @ARGV;

# and store today's date for comparison.
# who needs the stinkin' Date:: modules?!
my ($day, $month, $year) = ((localtime)[3, 4, 5]);
$year+=1900; $month = sprintf("%02.0d", ++$month);
$day = sprintf("%02.0d", $day);  # zero-padding.
my $today = "$year-$month-$day"; # final version.

# loop through each RSS URL.
foreach my $rss_url (@ARGV) {

    # download whatever we've got coming.
    print "Downloading RSS feed at ", substr($rss_url, 0, 40), "...\n";
    my $data = get($rss_url) or print " [err] Data not downloaded!\n";
    next unless $data; # move onto the next URL in our list, if any.

    # parse it and then
    # count the number of items.
    # move on if nothing parsed.
    my $rss = new XML::RSS; $rss->parse($data);
    my $item_count = scalar(@{$rss->{items}});
    unless ($item_count) { print " [err] No parsable items.\n"; next; }

    # sandwich our post between a preface/anteface.
    my $clink = $rss->{channel}->{"link"}; # shorter variable.
    my $ctitle = $rss->{channel}->Aggregating RSS and Posting Changes; # shorter variable.
    my $preface = "From <a href=\"$clink\">$ctitle</a>:\n\n<blockquote>";
    my $anteface = "</blockquote>\n\n"; # new items as quotes.

    # and look for items dated today.
    foreach my $item (@{$rss->{items}}) {

        # no description or date for our item? move on.
        unless ($item->{description} or $item->{dc}->{date}) {
          print " Skipping (no description/date): '$item->Aggregating RSS and Posting Changes'.\n";
          next;
        }

        # if we have a date, is it today's?
        if ($item->{dc}->{date} =~ /^$today/) {

            # shorter variable. we're lazy.
            my $creator = $item->{dc}->{creator};

            # if there's a filter, check for goodness.
            if ($opts{filter} && $item->{description} !~ /$opts{filter}/i) {
                print " Skipping (failed filter): '$item->Aggregating RSS and Posting Changes'.\n"; 
                next;
            }

            # we found an item to post, so make a
            # final description from various parts.
            my $description = "$preface$item->{description} ";
            $description   .= "($creator) " if $creator;
            $description   .= "<a href=\"$item->{link}\">Read " .
                              "more from this post.</a>$anteface";

            # now, post to the passed blog info.
            print " Publishing item: '$item->Aggregating RSS and Posting Changes'.\n";
            my $id = $mt->metaWeblog(  )->newPost(
                              title       => $item->Aggregating RSS and Posting Changes,
                              description => $description,
                              publish     => 1)
                     or die "[err] ", $mt->LastError(  ), "\n";

            # set the category?
            if ($opts{catid}) {
                $mt->mt(  )->setPostCategories(
                              postid     => $id,
                              categories => [ {categoryId => $opts{catid}}])
                or die " [err] ", $mt->LastError(  ), "\n";

                # "edit" the post with no changes so
                # that our category change activates.
                $mt->metaWeblog(  )->editPost(
                              title       => $item->Aggregating RSS and Posting Changes,
                              description => $description,
                              postid      => $id,
                              publish     => 1)
                     or die " [err] ", $mt->LastError(  ), "\n";
            }
        } else { 
           print " Skipping (failed date check): '$item->Aggregating RSS and Posting Changes'.\n"; 
        }
    }
    print "-" x 76, "\n"; # visual separator.
}

exit;

See also:



O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.