Serve Paid Content to Spiders and the Public as Babel

by Sid Steward

Related link: http://lookleap.com/site/mixup



Here's a working example and PHP script for randomizing web page text in-place. The result scans well and indexes well, but it doesn't give the story away.


I think this would be a good technique for interfacing paid content with the free web. It would be friendlier to users than an access denied page. And scanning the randomized page gives the reader an idea of what the page is talking about, tempting her into buying the content.


You can visit my mixup page or try it out right here:




Please Enter the URL of a web page:





Download the PHP Code Here



I worry, however, that a search engine might detect the random pattern (so to speak) and consider the page spam. I would appreciate your insight, here.



Would search engines flag randomized web pages as spam?


5 Comments

aristotle
2005-10-05 07:18:55
Re:
Problem: as a countermeasure to search engine scamming, Google (and by now probably every other major engine as well) occasionally crawls your pages with an agent that does not reveal itself as a crawler, and penalises you if it finds that the content differs consistently.
sid_steward
2005-10-05 08:24:00
Re: spoofing
Thanks for the note. I imagined that a publisher could serve mixed-up content to both the search engines and the non-paying public. Only logged-in subscribers would see the straight text. That way the publisher would pass your test.
sid_steward
2005-10-05 11:02:16
Re:
Wait... I think I see what you mean. Randomly changing the page from view to view would look bad to the spider. Gotcha. Caching the randomized page should solve that problem. Same babel each time.
aristotle
2005-10-05 13:06:01
Re:
Ah – yes, then it would work.
aristotle
2005-10-05 13:11:36
Re:
No, I did mean that the public and the spider seeing different things would get you penalised by search engines.


I considered it a given that the page should not change from view to view, but that’s easy to achieve. You could calculate a hash of the page content and use that as a seed for your random number generator, f.ex. As long as the page content doesn’t change, neither will its hash, so the output would always be jumbled the same way. Anyway, those are implementation details.