The search for search's next generation
by Andy Oram
Current search engines--even the constantly surprising Google--seem
unable to leap the next big barrier in search: the trillions of bytes
of dynamically generated data created by individual Web sites around
the world, or what some researchers call the "deep web." You can't
look up the status of a Federal Express package without going to the
Federal Express site, or the details on an eBay item without checking
the eBay site. Dynamically generated data can't be spidered.
But the article cited above shows how this barrier is slowly
cracking. Now I can enter "fedex 791725670102" into Google (not
Federal Express) and discover that the jigsaw puzzle I mailed to an
author in Australia was signed for by him.
Of course, Google has to send me to the Federal Express site (which
takes an extra click) to complete the search, but the principle is
established: a search at Google can kick off a deep search on another
The burn-out of the dot-com era left a smoldering envy of those few
dot-commers that managed to stay alive. Google is foremost among
these. If they can continue pulling in dynamic data from more and more
sites, their dominance may well continue--for access to dynamic data
is indeed the key to the next big improvement in search.
A generalization of the Google/FedEx collaboration would lead to what
is commonly called
a peer-to-peer solution to the search problem that involves a
radically different architecture from any of the current popular
engines. I said different, not new. The idea of peer-to-peer search
was aired at least as far back as early 2000. I described it in my
on peer-to-peer systems in May of that year:
Gnutella is a fairly simple protocol. It defines only how a string
is passed from one site to another, not how each site interprets
the string. One site might handle the string by simply running
fgrep on a bunch of files, while another might insert it
into an SQL query, and yet another might assume that it's a set of
Japanese words and return rough English equivalents, which the
original requester may then use for further searching. This
flexibility allows each site to contribute to a distributed search
in the most sophisticated way it can. Would it be pompous to
suggest that Gnutella could become the medium through which search
engines operate in the 21st century?
What's holding back metasearch is the lack of standards for
categorizing data and knowing what to search for. It's easy to guess
that "fedex 791725670102" should be interpreted as a search for a
Federal Express package, but anything less strictly defined is a big
A lot of people have dumped on the ideal of metadata, notably Cory
Doctorow in the article
So the waters of the deep web will be slow to stir, but as the
benefits become clear, more and more sites may emerge.
What business model would drive metasearch? That question is classic
in peer-to-peer systems, because distributed systems typically have
problems generating and distributing income. Sites could be motivated
to solve the metadata problem because they'd draw more traffic by
joining the system, and expose more of their data to people's
As for the aggregating site--Google or a competitor--it would
potentially have an easier road to profitability than Google has
now. The aggregating site could continue to derive revenue from ads
and from the sale of search software. Since the computing resources it
needed would be vastly less than the current Google, it would need
less revenue from ads and sales. And since the use of its software
would be a prerequisite to joining (although one hopes it would
tolerate the use of compatible, competing software) it should be able
to land more sales.
Can metasearch become widespread?
The business of search
If the recent woes of the music industry have taught us anything, it's that the easier it becomes to perform a service, the less people are willing to pay for it. If the computational resources for "deep" searching are drastically reduced, more and more people will jump in the game. Not saying it's a bad thing, the users will benefit, but such deep search would be an extremely disruptive technology. If Google is looking to go that route they're going to have to leverage it to make something even better if they want to make money off of it.
the role of metadata in searching
As part of his weblog, Tim Bray has written an excellent introduction to search engine technology which includes an entry about metadata at http://tbray.org/ongoing/When/200x/2003/07/29/SearchMeta. See http://tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC for the TOC to the complete series on searching.