Distributed Computing Economics and the Semantic Web

by William Grosso

Related link: http://research.microsoft.com/~Gray/



I went to see Jim Gray speak the other night. He was the first speaker in this fall's Distinguished Speaker series at SDForum.
I liked the talk a lot. In particular, I very much enjoyed the part of his talk dealing with Distributed Computing Economics.


The argument itself is basic economic analysis, and can be boiled down to the notion that since everything costs money, you should consider the costs of everything when building applications. In particular, Gray focuses on the costs of cpu time (small, and dropping all the time) and the cost of network bandwidth (not so small, and decreasing at a slower rate). By putting actual dollar values on things, Gray is able to draw some startling conclusions about when it makes sense to use grid-computing techniques, and when it makes sense to either use a LAN-based system or a single machine (as opposed to distributing the computation over a WAN, or using "on-demand" computing).


In particular, he says the following: the break-even point is 10,000 instructions per byte of network traffic, or about a minute of computation per MB of network traffic. That is, unless the cpu time at the other end of the pipe is free, and you get a minute of computation for every MB of data you send to it, you're better off doing the computation locally.


There's an interesting reverse to this. If you're running a database on the wire, you'd much rather someone ask you to do a computation than ask you to send a large amount of data in response to a query (the economics apply when you're sending an answer as well).


Two things struck me while Gray was speaking. The first is that the analysis isn't very different from that in Gray's classic papers on the five minute rule. But despite the fact that a Turing Award winner repeatedly uses this style of argument, I don't see it being applied very often in other areas.


The second is that I think it very much applies to the semantic web. If you'll recall, the idea of the semantic web is to create a giant distributed knowledge-base, with lots of information encoded in RDF triples so that the machines, as well as the humans, can process the data.


Now along comes Gray, making an argument that, when you think about it, implies that the semantic web, as currently conceived, might just be all wrong. His basic point is that it's far cheaper to vend high-level apis than give access to the data (because the cost of shipping large amounts of data around is prohibitive). Since the semantic web is basically a data web, one wonders: why doesn't Gray's argument apply?


Here are three possible counterarguments:


  1. The idea of the semantic web is that there are literally hundreds of thousands of data sources. In such a universe, the only feasible programming model is to gather data into a central location and then perform the computation (coordinating a distributed application on such a scale is simply not feasible).

  2. The point of the semantic web is that it concerns data which is inherently impossible to gather in one location. Gray's economic argument doesn't apply because it assumes that it is possible to put the application on a LAN (or using high level apis) instead of fetching data over a WAN. Clearly that's not currently the case for web-applications like Google, and the proposed semantic web applications (what are they, anyway?) are more google-like than not.

  3. Gray's argument assumes infinite divisibility of computing resources. While it may be true that, once you've bought a computer, the cost of computation is cheap, you can't buy a single unit of computation-- you have to buy the entire computer and then amortize. So, depending on cash flow considerations, and the amount of computing power you really need, some applications might still make sense in an on-demand model.


My point? In everything I've read about the semantic web, nobody's addressed Gray's implicit question. Have I missed a large section of papers? Is it obvious that one of the above three arguments is the "killer rejoinder" to "vend high level APIS, not data"? Is the semantic web really about APIs (and I just missed it)? Or is there a crucial hole in the roadmap to the semantic web?




So what's the deal? What applications will the semantic web make both possible and economically feasible?


11 Comments

anonymous2
2003-09-22 21:25:35
Orthogonal Issues
The main points of the semantic web are to have a common data model (aka knowledge representation language, ie RDF) and an easy way for people to publish their data/knowledge. To what extent the information gets left on the server where it is published or is cached closer to end-user clients
or massively-aggregating clients... is up to
economics. The existing web has a lot of (working but imperfect) technology
to address data locality issues.


-- sandro@w3.org


wegrosso
2003-09-22 22:04:56
I'm not sure I understood your point
I think we agree that RDF is a data model, and the semantic web is a vision wherein everyone can publish their data (as RDF triplets, or elements of an OWL knowledgebase, or whatever) and the entire internet becomes an enormous distributed knowledgebase.


Now, my point is that Gray's paper implies this is exactly the wrong thing to build. Or, at least, a very inefficient and costly thing to build. Because it pushes you into architectures where you collect vast amounts of data, aggregate them in centralized servers (or groups of servers), and then compute.


And your answer concedes the point (I think)-- you're talking about pushing data around till it arrives at the right location.


Gray's point is that efficiency (economics) says: don't gather data. Instead, distribute computation using the highest level APIs you can find or build.


While, technically speaking, RDF and the Semantic Web don't prevent high level APIs, the current efforts also don't foster it. As far as I know (and I'm willing to be corrected; please give me references), the vast majority of efforts are focused on publishing data.


And that was my point. Why focus early efforts in a place where the economics are so unwieldy? I came up with three candidate arguments, none of which felt compelling.


anonymous2
2003-09-23 01:13:04
central vs. web, apis vs. data
Good article. Here are some counterarguments:


The data starts out decentralised and disconnected (I am in Australia you are in the USA, working independently) . A third party discovers that she can combine our data and repurpose it. Gray's analysis seems to assume more a priori design than this. i.e. that we can design the system for a specific purpose.


But suppose we create a central site that combines your data with mine and offers a high level api. The economics still depend on how often we change the data and how much upload traffic that creates versus how often people happen to want to use the combined resource under the particular assumptions of the high level api.


Finally, the SW ontology and inference technology is ideal for connecting your independently conceived data with mine whether we do the processing centrally or at each client's site. And REST is a good basis for a high level API....


- Arnold deVos

anonymous2
2003-09-23 01:35:31
What are the semantic web applications anyway?
I think you hit the nail on the head with your aside asking what are these semantic web applications. It's very difficult to build high level apis for an unknown application. I guess the point of the semantic web is that it provides the basis for many thousands of applications.


One other point. The decision whether to distibute the computation of the data assumes that both are equally accurate. But in an open system like the Internet, how do you know the answer to your question is correct if you don't look at the data?

anonymous2
2003-09-23 02:55:44
Agoric Computation
There is an area of research dealing with applying market economics to computation.


From one of the papers:
"Similar considerations hold among computational objects. For small enough objects and transactions, the cost of accounting and negotiations will overwhelm any advantages that may result from making flexible, price-sensitive tradeoffs. For large enough objects and trans- actions, however, these overhead costs will be small in percentage terms; the benefits of market mechanisms may then be worth the cost. At an intermediate scale, negotiation will be too expensive, but accounting will help guide planning. These scale effects will encourage the aggregation of small, simple objects into "firms" with low-overhead rules for division of income among their participants.


Size thresholds for accounting and negotiations will vary with situations and implementation techniques [16]. Market competition will tune these thresholds, providing incentives to choose the most efficient scale on which to apply central-planning methods to computation."


http://www.agorics.com/Library/agoricpapers/aos/aos.3.html


-Andrew

anonymous2
2003-09-23 04:06:13
Querying
I think there is probably a couple of pieces "missing" in the semantic web.


The first is querying. By querying a database, rdf store, etc. through a query you can only transmit the triples that meet your criteria.


The other is being able to access RDF models externally. You could query across different data sources such as RDF databases, SQL databases, RDF files on your local hard, RSS feeds off the web, etc.


Some of these data sources can support having the amount of data returned reduced by a query, the others need to be parsed, stored and then queried.


Hopefully it's the small, quickly changing, RDF sources, that don't support querying and not large , quickly changing RDF sources.

wegrosso
2003-09-23 08:29:06
Checking results
Re: the second paragraph. You have the same question, in spades, for the data. Why trust the data? Whoever put it on the web undoubtedly had a reason for wanting to share the data; assuming that they're pure of heart and simply want to expand the world's knowledge pool is problematic.


As far as computation goes... two thoughts spring to mind. The first is that SETI@Home and Popular Power had the same question-- how do you prevent cheating. They came up with two solutions: (1) occasionally send the task to multiple users and compare answers and (2) have time and length estimates, that you can use to doublecheck that the time and length claims made by your cpu source are accurate.


The other thought is that it's often a lot easier to verify answers than come up with them. Mathematical proofs are an obvious case of this, but so are things like travel plans (and J2ME's new class verifier works on this principle).

anonymous2
2003-09-23 13:22:26
the SemWeb doesn't really bear on this issue



Obviously there's a certain amount of distribution natural in the universe. Knowledge is created in some places and consumed in others, etc. You're thinking that the Semantic Web presumes that processing will only be done at the consumer end? Of course the producer may publish processed views.


You want the consumer to be able to ask the producer to process it in custom ways? That involves downloading code, right? Arguably, the best SemWeb approach to code is Horn rules, which fit naturally into query languages -- so a consumer could ask a producer for the answer to some query, where the query involves arbitrary processing. Does that give you what you want? I'm cool with that, although it seems like no one really want to run anyone else's code.


Of course RDF Query work is still in early stages; right now almost everyone just fetches an RDF/XML file, however big it is. That is a rather bandwidth-rich approach, of course. Some query designs do allow rules, however, so if the policies allow it, the client can ask the server to pre-process the data to make it smaller.


sandro@w3.org


anonymous2
2003-09-23 14:22:02
Checking results
I was thinking of the Google situation. It would be much more efficient for website owners to add their own metadata than for Google to slurp up the entire Web and try and work it out. I'm sure website owners would be very keen to do this -- they would, and they do, lie through their teeth.


I think your Seti & PP are particular kinds of examples where they already have the data. So it is interesting from the point of view whether it is cheaper to ship the data somewhere else perform your own computations, but doesn't really have much to do with the semantic web, which as I see it, is much more about some form of machine understanding of data so that some undetermined alaysis can be performed on it. Specifically, analysis by people who don't create or control the data.

mondo
2003-09-23 15:27:47
Mobile agents to the rescue?
One possible solution that honors the economics of computation vs. data transfer are mobile software agents. This way you could move your application logic (closer) to the source of data and compute there.


How does this relate to semantic web? Well, look at it as a big knowledge base. I think your point that it's impossible (or impractical) to concentrate it in one place is valid.

anonymous2
2003-09-23 20:02:20
Interoperation of Web Services
Why should the Semantic Web require that large date sets are transferred?
In my mind the most likely model is that people (agents, whoever) send focused queries to Web Services, which return focused and small answers.
Query processing takes place on the server.


Under this assumption the main question is how to establish interoperation between all these Web Services. And that may be the core question, to which RDF and OWL etc. is (part of) the answer.


Best,
Stefan Decker