ABSTRACT

Running MapReduce programs in the cloud introduces the important problem: how to optimize resource provisioning to minimize the nancial charge or job nish time for a specic job. An important step toward this ultimate goal is modeling the cost of MapReduce program. In this chapter, we study the whole process of MapReduce processing and build up a cost function that explicitly models the relationship among the amount of input data, the available system resources (map and reduce slots), and the complexity of the reduce program for the target MapReduce job. The model parameters can be learned from test runs. Based on this cost model, we can solve a number of decision problems, such as the optimal amount of resources that minimize the nancial cost with a job nish deadline, minimize the time under certain nancial budget, or nd the optimal tradeoffs between time and nancial cost. With appropriate modeling of energy consumption of the resources, the optimization problems can be extended to address energy-efcient MapReduce computing. Experimental results show that the proposed modeling approach performs well on a number of tested MapReduce programs in both the in-house cluster and Amazon EC2.