O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  

Search Yesterday's Index
Monitor a set of queries for new finds added to the Google index yesterday
The Code
[Discuss (0) | Link to this hack]

The Code

Save the following code as goonow.pl. Be sure to replace insert key here with your Google API key along the way.

#!/usr/local/bin/perl -w
# goonow.pl
# Feeds queries specified in a text file to Google, querying
# for recent additions to the Google index.  The script appends
# to CSV files, one per query, creating them if they don't exist.
# usage: perl goonow.pl [query_filename]
# My Google API developer's key.
my $google_key='insert key here';
# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";
use strict;
use SOAP::Lite;
use Time::JulianDay;
$ARGV[0] or die "usage: perl goonow.pl [query_filename]\n";
my $julian_date = int local_julian_day(time) - 2;
my $google_search  = SOAP::Lite->service("file:$google_wdsl");
open QUERIES, $ARGV[0] or die "Couldn't read $ARGV[0]: $!";
while (my $query = <QUERIES>) {
  chomp $query;
  warn "Searching Google for $query\n";
  $query .= " daterange:$julian_date-$julian_date";
  (my $outfile = $query) =~ s/\W/_/g;
  open (OUT, ">> $outfile.csv")
    or die "Couldn't open $outfile.csv: $!\n";
  my $results = $google_search ->
      $google_key, $query, 0, 10, "false", "",  "false",
      "", "latin1", "latin1"
  foreach (@{$results->{'resultElements'}}) {
    print OUT '"' . join('","', (
      map {
        s!\n!!g; # drop spurious newlines
        s!<.+?>!!g; # drop all HTML tags
        s!"!""!g; # double escape " marks
      } @$_{'title','URL','snippet'}
    ) ) . "\"\n";

You'll notice that GooNow checks the day before yesterday's rather than yesterday's additions (my$julian_date=intlocal_julian_day(time)-2;). Google indexes some pages very frequently; these show up in yesterday's additions and really bulk up your search results. So if you search for yesterday's results in addition to updated pages, you'll get a lot of noise, pages that Google indexes every day, rather than the fresh content that you're after. Skipping back one more day is a nice hack to get around the noise.

O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.