The resources of hardware and bandwidth are quite prohibitive.
Thus Open source attempts should tackle it from a different angle. Rather than attempting the google/yahoo/MSN route an alternative route like SETI should be used.
That is leverage the unused bandwidth and cycles of the community and beyond to do the trawling and analysis. Just centralise the summaries maybe.
I'm no search guru beyond standard inverted tress and ratings but I bet it is possible to do it in a distributed fashjion like this and it would remove the resource issues.
worth thinking about anyhow...