Using the Google API to Construct a Document Corpus

by Schuyler Erle

Recently, Peter Sergeant made an appeal to the Perl community for assistance in finding bugs in his Rich Text Format parser, better known as RTF::Tokenizer, which he hopes will be the centerpiece of a whole suite of Perl modules for manipulating RTF documents. As an incentive, Pete offered to donate $10 to the Perl Foundation for every major bug reported. Also, he announced separately that he was looking for examples of certain RTF features to exercise certain aspects of the tokenizer.


After learning about his project, I suggested to Peter that he might consider using a web spider to automatically locate and assemble a corpus of sample RTF documents that he could use to stress-test his toolkit, as bits of it get written. His unhesitating response was, to paraphrase, "Great idea! Can you help me out by writing it?"


So, of course, I agreed to do so, and my first thought was -- why write a spider to search the Web when Google's already indexed the thing for us? Google already supports a very nice filetype: parameter that allows you to specify the document format you want in your search results, so, to that, I just added +the, to request any RTF document with the word "the" in it -- i.e. most of them -- and voilá!


Naturally, doing this programmatically, and in such a way that adheres to their Terms of Use, necessitates using their amazing SOAP API. You need to first register with Google for a developer key in order to access the service, but all they want is your name and a verifiable e-mail address, so the whole process of registration took only a couple of minutes.


Next, following the directions and borrowing some sample code from Rael Dornfest's weblog entry on the subject, I was able to quickly hack together a script that asks Google for RTF documents containing common English words, and then fetches those documents off the Web, storing them to the local filesystem with a filename based on the MD5 hash of the document. This has the advantage of ensuring that each document has a unique filename in the directory (more or less), and gives us a way of figuring out whether or not we already have a given document in our corpus, if it turns out to be accessible via multiple URLs. Finally, the script builds a YAML index of where each file came from, for future reference. (Although this index could have been encoded in anything, including plain, whitespace-separated text.)


Anyway, the following script performed extraordinarily well, downloading 3,295 unique RTF documents from the Web in a little over three hours, for a total of over 425 MB of data. The process required 400 Google search queries to find these body of RTF documents -- which is well within the daily limit of 1,000 per user. Apparently, according to Peter, between 1 and 2% of the documents in the corpus turned out to not even be remotely valid RTF, for whatever that's worth. Also, the script could probably be optimized to perform this task an order of magnitude faster, but time was not of the essence for me, so I leave that as an exercise for the reader.


The thing that I was amazed by, and the thing I want you to take away from this little tale, is the sheer elegance of the Google web services API. It was trivial to program against, and it just worked, the very first time.


The code from my script follows, or can be downloaded here, in case you're interested. Happy hacking!




Listing: rtf-catalog.pl



#!/usr/bin/perl -w

use LWP::Simple;
use SOAP::Lite;
use Digest::MD5 qw( md5_hex );
use YAML qw( LoadFile DumpFile );
use IO::File;
use strict;

my $key = shift(@ARGV) or die "usage: $0 <google_key> [<catalog.yaml>]\n";
my $type = "rtf";
my @words = qw( the is of and );
my $cat_file = shift(@ARGV) || "catalog.yaml";
my ($catalog, $seen);

if (-r $cat_file) {
warn "* Loading catalog from $cat_file ...\n";
$catalog = LoadFile( $cat_file );
$seen->{$catalog->{$_}} = $_ for keys %$catalog;
} else {
warn "Can't load catalog from $cat_file. Creating a new one...\n";
$catalog = {};
}

my $start = 0; # $catalog->{START} || 0;
my $done = 0;
my $word;

$SIG{INT} = $SIG{HUP} = $SIG{TERM} = sub { $done++ };

warn "* Initiating Google search service...\n";
my $google = SOAP::Lite->service("http://api.google.com/GoogleSearch.wsdl");

until ($done) {
if (not $word or $start >= 1000) { # Google doesn't return results > 1000
$word = shift @words; $start = 0;
if ($word) {
warn "* Now using search term '$word'\n";
} else {
warn "* Run out of search terms! Done.\n";
exit;
}
}

warn "* Querying Google for results $start + ...\n";

# key, q, start, maxResults, filter, restrict, safeSearch,
# lr, ie, oe
my @params = ($key, "filetype:$type +$word", $start,
10, 0, '', 0, '', '', '');
my $result = $google->doGoogleSearch(@params);

for my $item (@{$result->{resultElements}}) {
last if $done; # someone hit the stop button

my $url = $item->{URL}; # make sure it's RTF
next unless $url =~ /\.$type$/o;

if ($seen->{$url}) { # already have it.
warn "= $url\n";
next;
}

warn "+ $url\n";
my $data = get( $url );
unless ($data) {
warn "Can't load $url?\n";
next;
}

my $md5 = md5_hex( $data );
$md5 = substr($md5, 0, 16); # leave somewhat manageable filenames
my $file = "$md5.$type";
if (-r "$file") { # Already have it.
warn "| $url = $file\n";
next;
}

my $fh = IO::File->new(">$file");
unless ($fh) {
warn "Can't write to $file??\n";
next;
}

warn " -> $file\n";
$fh->print($data);
$fh->close;
$catalog->{$md5} = $url;
$seen->{$url} = $md5;
}

warn "* Writing catalog...\n";
DumpFile( $cat_file, $catalog );

$start += 10 unless $done;
}



What kinds of novel uses have you found for Google's web services API? Can you recommend any improvements to this approach to document corpus building?


1 Comments

gojomo
2003-04-17 22:06:34
You could also use "inurl:"...
...instead of searching for "the", since most RTF docs will end ".rtf". That is...


filetype:rtf inurl:rtf


Although either gets you more hits than you can probably use, the "inurl:" trick actually yields a larger Google hit total estimate.