The Greatest Test of Open Source: Beating Google

by Steve Mallett

In the last couple of years one of the greatest software engineering projects has surfaced and become a household name. Google. One thing powers its greatness. Software.

The open source world prides itself rightly on its incredible successes. Apache Server, Linux, all email software worth mentioning, and recently Firefox. These are, have been, and will forever be marvelous feats.

What real technological competition have they been up against? Firefox vs the long abandoned IE. Apache against ISS. Linux vs Windows Server (Unix is technologically great, but cut its own throat to succeed en mass). Frankly, these successes have balanced more on putting out the word that they exist, disarming FUD, and the willingness of people to try something new.

Google. Its technological greatness is revered by all. Others like Yahoo are chasing it, but at best they'll do nothing more than chase it. They have no real advantage over Google.

A search site takes a lot more than just bitchin' software. There are a lot of costs. Bandwidth of crawling is the biggie, serving results, hardware, people.

Enter Nutch. Nutch is an open source search engine crawler, indexer, etc. The project appears to have been a bit dormant since its first media splash a few years ago, but has just recently become incubated with the Apache Software Foundation.

As I write this I have Nutch crawling a few sites just to test it out on my own. It's the fifth of my tests. I'm increasing the search depth, and playing with a few of its knobs & buttons. The first few tests worked, but weren't terribly compelling. Not that the Nutch site doesn't give you the straight goods upfront. Their site says, "Nutch has not yet been tuned for quality. There are ten or twenty knobs that we can twiddle to adjust the ranking formula. We are developing software to do this tuning automatically, but the current code just contains guesses. With a little tuning we should be able to get results that are competitive with those of major search engines."

Attract some more developers and I bet this happens sooner than later.

I think a commercial search engine based on Nutch could be a huge deal. Such an operation requires a ton of money for equipment and bandwidth so it would have to pay its own bills. However; the open source software component would give such an operation a scrappy little advantage. If open source can take on a truly great competitor, the operation would have the distinct advantage of better results and not the overhead of personnel like Google, Yahoo!, and their ilk have. The new search site would want to hire key people so they don't have to worry about paying the rent and feeding the kids, but that's a lot more talent available for less.

I think, and I'm really only guessing, that Nutch hasn't prospered to where I would like it to be because of the costs of running the operation. To truly test the system you need a big index. You need to spend a lot of money crawling. To test it against Google anyway. How big? Well, the Internet Archive hosts "some work" of Nutch's. They seem to have more bandwidth than the average bear.

Back to the main point... given the resources could an open software based search engine beat a great proprietary competitor. There's only one way to find out, and what counts most of all is real results.

6 Comments

tima
2005-02-12 18:59:26
More then one thing powers Google greatness.

Google. One thing powers its greatness. Software.


I would disagree. The infrastructure -- particularly its scale -- is as much a part of their greatness (perhaps even more so) as the software that powers it. This is implied, though not acknowledged:


...Nutch hasn't prospered to where I would like it to be because of the costs of running the operation. To truly test the system you need a big index. You need to spend a lot of money crawling. To test it against Google anyway.


So, I really don't think this is a challenge open source can entirely address since there is no such thing as "open source hardware" or "open source bandwidth" that is necessary to match Google's feat.

dumbfounder
2005-02-13 09:55:14
search engine costs
I think time is by far the biggest cost when developing/deploying a search engine, even if you have great software to work with. For open source developers to be able to tune their software to crawl the internet effectively (which means avoiding the endless sea of web spam) and to produce relevant results for different indexes may be too great a task. I speak from experience, I have been developing a search engine full time over the past 15 months. I have used cheap bandwidth (7 residential dsl lines costing about $400/month total) to download about half a billion pages, which is plenty of data to test my algorithms. Check it out at dumbfind.com. (not all half billion pages are online)
spaceman
2005-02-14 07:15:48
More then one thing powers Google greatness.
"I would disagree. The infrastructure -- particularly its scale -- is as much a part of their greatness (perhaps even more so) as the software that powers it. This is implied, though not acknowledged:"


Jeez, I dunno about that. Yahoo certainly has a ton of infrastructure & they don't touch Google.


"So, I really don't think this is a challenge open source can entirely address since there is no such thing as "open source hardware" or "open source bandwidth" that is necessary to match Google's feat."


That's true. I was alluding (or thought I was) to this with the link to a commercial entity to ramp up that part of the equation.

babelex
2005-02-14 16:05:33
Open source needs to do it different
The resources of hardware and bandwidth are quite prohibitive.


Thus Open source attempts should tackle it from a different angle. Rather than attempting the google/yahoo/MSN route an alternative route like SETI should be used.


That is leverage the unused bandwidth and cycles of the community and beyond to do the trawling and analysis. Just centralise the summaries maybe.


I'm no search guru beyond standard inverted tress and ratings but I bet it is possible to do it in a distributed fashjion like this and it would remove the resource issues.


worth thinking about anyhow...

Cell-Phone-Search-Engine
2005-09-16 22:24:36
search engine operational cost: can outsource!
I think you can outsource your operational people to India or China, because they can remotely manage the platform. - Roboo Meshfire
sprinko
2006-09-08 09:28:18
Great Search Engine, check out our recent web portal. Sprinko.com is a Fun way to search the web for news, images, articles, encyclopedia, dictionary and videos.