Parameterizable benchmarking framework for designing a MapReduce performance model
Concurrency and Computation: Practice & Experience(2014)
摘要
In MapReduce environments, many applications have to achieve different performance goals for producing time relevant results. One of typical user questions is how to estimate the completion time of a MapReduce program as a function of varying input dataset sizes and given cluster resources. In this work, we offer a novel performance evaluation framework for answering this question. We analyze the MapReduce processing pipeline and utilize the fact that the execution of map reduce tasks consists of specific, well-defined data processing phases. Only map and reduce functions are custom, and their executions are user-defined for different MapReduce jobs. The executions of the remaining phases are generic i.e., defined by the MapReduce framework code and depend on the amount of data processed by the phase and the performance of the underlying Hadoop cluster. First, we design a set of parameterizable microbenchmarks to profile the execution of generic phases and to derive a platform performance model of a given Hadoop cluster. Then, using the job past executions, we summarize job's properties and performance of its custom map/reduce functions in a compact job profile. Finally, by combining the knowledge of the job profile and the derived platform performance model, we introduce a MapReduce performance model that estimates the program completion time for processing a new dataset. The proposed benchmarking approach derives an accurate performance model of Hadoop's generic execution phases once, and then, this model is reused for predicting the performance of different applications. The evaluation study justifies our approach and the proposed framework: We use a diverse suite of 12 MapReduce applications to validate the proposed model. The predicted completion times for most experiments are within 10% of the measured ones with a worst case resulting in 17% of error on our 66-node Hadoop cluster. Copyright © 2014 John Wiley & Sons, Ltd
更多查看译文
关键词
MapReduce processing pipeline,Hadoop cluster,benchmarking,job profiling,performance modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络