O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


 
Buy the book!
Spidering Hacks
By Kevin Hemenway, Tara Calishain
October 2003
More Info

HACK
#89
Filtering for the Naughties
Use search engines to construct your own parental control ratings for sites
The Code
[Discuss (0) | Link to this hack]

As we've attempted to show several times in this book, your scripts don't have to start and end with simple Perl spidering. You can also incorporate various web APIs (such as Technorati ). In this hack, we're going to add some Google API magic to see if a list of domains pulled off a page contain prurient (i.e., naughty) content—as determined by Google's SafeSearch filtering mechanism.

As the hack is implemented, a list of domains is pulled off Fark (http://www.fark.com), a site known for its odd selection of daily links. Each domain has 50 of its URLs (generated by a Google search) put into an array, and each array item is checked to see if it appears in a Google search with SafeSearch enabled. If it does, it's considered to be a good URL. If it doesn't, it's put under suspicion of being a not-so-good URL. The idea is to get a sense of how much of an entire domain is being filtered, instead of just one URL.

TIP

Filtering mechanisms are not perfect. Sometimes they filter things that aren't bad at all, while sometimes they miss objectionable content. While the goal of this script is to give you a good and general idea of a domain's content on the naughtiness scale, it won't be perfect.

Hacking the Hack

You might find something else you want to scrape, such as the links on your site's front page. Are you linking to something naughty by mistake? How about performing due diligence on a site you're thinking about linking to; will you inadvertently be leading readers to sites of a questionable nature via a seemingly innocent intermediary? Perhaps you'd like to check entries from a specific portion of the Yahoo! or DMOZ directories ? Anything that generates a list of links is fair game for this script.

As it stands, the script checks a maximum of 50 URLs per domain. While this makes for a pretty thorough check, it also makes for a long wait, especially if you have a fair amount of domains to check. You may decide that checking 10 domains is a far better thing to do. In that case, just change this line:

if ($firstresult > 10) { $firstresult = 10; }

When Tara originally wrote the code, she was a little concerned that it might be used to parse naughty sites and generate lists of naughty URLs for porn peddling. So, she chose not to display the list of naughty URLs generated, unless they were a significantly minor proportion of the final results (currently, the threshold is set to no more than 10 of the 50 URLs). You might want to change that, especially if you're using this script to check links from your own site and you want to get an idea of the kind of content you might be linking to. In this case you'll need to change just one line:

unless ( $badcount >= 50 || $badcount == 0) {

By increasing the count to 50, you'll be informed of all the bad sites associated with the current domain. Just be forewarned: certain domains may return nothing but the naughties, and even the individual words that make up the returned URLs can be downright disturbing.

The Code

Save the following code as purity.pl:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use SOAP::Lite;

# fill in your google.com API information here.
my $google_key  = "your Google API key here";
my $google_wdsl = "GoogleSearch.wsdl";
my $gsrch       = SOAP::Lite->service("file:$google_wdsl");

# get our data from Fark's "friends".
my $fark = get("http://www.fark.com/") or die $!;
$fark =~ m!Friends:</td></tr>(.*?)<tr><td class=\"lmhead\">Fun Games:!migs; 
my $farklinks = $1; # all our relevances are in here.

# and now loop through each entry.
while ($farklinks =~ m!href="(.*?)"!gism) {
   my $farkurl = $1; next unless $farkurl;
   my @checklist; # urls to check for safety.
   print "\n\nChecking $farkurl.\n";

   # getting the full result count for this URL.
   my $count = $gsrch->doGoogleSearch($google_key, $farkurl,
                        0, 1, "false", "",  "false", "", "", "");
   my $firstresult = $count->{estimatedTotalResultsCount};
   print "$firstresult matching results were found.\n";
   if ($firstresult > 50) { $firstresult = 50; }

   # now, get a maximum of 50 results, with no safe search.
   # getting the full result count for this URL.
   my $counter = 0; while ($counter < $firstresult) {

       my $urls = $gsrch->doGoogleSearch($google_key, $farkurl,
                           $counter, 10, "false", "",  "false", "", "", "");

       foreach my $hit (@{$urls->{resultElements}}) {
           push (@checklist, $hit->{URL}); 
       } $counter = $counter +10; 
   }

   # and now check each of the matching URLs.
   my (@goodurls, @badurls); # storage.
   foreach my $urltocheck (@checklist) {
       $urltocheck =~ s/http:\/\///;

       my $firstcheck = $gsrch->doGoogleSearch($google_key, $urltocheck,
                                 0, 1, "true", "",  "true", "", "", "");

       # check our results. if no matches, it's naughty.
       my $firstnumber = $firstcheck->{estimatedTotalResultsCount} || 0;
       if ($firstnumber == 0) { push @badurls, $urltocheck; }
       else { push @goodurls, $urltocheck; }
   }

   # and spit out some results.
   my ($goodcount, $badcount) = (scalar(@goodurls), scalar(@badurls));
   print "There are $goodcount good URLs and $badcount ".
         "possibly impure URLs.\n"; # wheeEeeeEE!

   # display bad domains if there are only a few.
   unless ( $badcount >= 10 || $badcount == 0) {
       print "The bad URLs are\n";
       foreach (@badurls) {
          print " http://$_\n"; 
       }
    }

   # happy percentage display.
   my $percent = $goodcount * 2; my $total = $goodcount+$badcount;
   if ($total==50) { print "This URL is $percent% pure!"; }

}


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.