Publish the PageRank Algorithm Now!

by Harold Davis

Related link: http://www.braintique.com/research/mt-archives/000130.shtml



Google Enterprise general manager David Girouard is quoted in a recent Information Week article as saying that Google's PageRank algorithm uses more than 100 variables in its calculations.

Google's PageRank algorithm is used for the all-important determination of how a search results are ordered. In other words, the higher the PageRank, the more likely you are to find a page using Google. Most people display Google search results ten per page. Studies have shown that there is a huge difference in the number of click-throughs you get if your result is one of the first three top-ranked pages, and also that there is close to 100% fall-off in click throughs after three pages (or thirty) search results. This helps to spell out the importance of PageRank and its gate-keeping function towards the information available on the Web.

If it is true that more than 100 variables are used to calculate a given Web page's PageRank, then PageRank has come along way from the rather simple mechanism published by Brin and Page in their graduate student papers, and used by Google in the early days.

In the proto-PageRank system published by Brin and Page, a page's PageRank is a fraction calculated recursively by summing the PageRanks of the pages that link to it, and applying a simple damping factor representing how likely it is for anyone to surf away from a given page. In this theoretical Web universe, the sum of all PageRanks is always 1. Here's some material from Building Research Tools with Google for Dummies about how Google works.

It's amusing to note that the term "PageRank" was probably coined to reflect Larry Page's role as the creator of the concept rather than because it is about ranking pages.

There is something deeply troubling about the complex and opaque nature of the 100+ variable unpublished PageRank algorithm as it stands today. In effect, this means that nobody (except Google insiders) understands how information in this most important of information portals passes the gate keepers.

It's probably unreasonable to expect Google to publish how PageRank really works in light of competition from other search engines, and the efforts of SEO Webmasters to game the system. But not publishing the details of the PageRank algorithm goes against the tenets of open source espoused by many who work at Google, violates the idea that information should be freely available (after all, this is a most important piece of meta information!), and deprives Google of the open-source-like benefits of community scrutiny.

So I say, free the PageRank algorithm now!

(This post is adapted from a March 31, 2005 Googleplex Blog entry.)

2 Comments

aristotle
2005-04-19 20:23:20
What for?
Sure, it would be nice to know why my site is ranked as it is.


But what’s in it for Google? The people who can and would properly scrutinize and improve the algorithm are limited enough that Google can probably hire them all. On the other hand, there’s an infinite amount of vultures who would use this information to the detriment of the service’s quality.


A perfectly understood, useless system full of spam may be perfectly understood, but it’s still useless.

mattfein
2005-04-20 04:53:04
What for?
I agree. Publishing the algorithm is simply a bad idea. In addition, it's reasonable to assume that the algorithm is adjusted regularly to prevent gaming the system by automated 'black box' analysis, so any published algorithm would not be particularly edifying, except to Google's competitors.