Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation
New Orleans, LA(2009)
摘要
Traditionally, cluster computing has employed checkpointing to address fault tolerance. Recently, new models for parallel applications have grown in popularity namely MapReduce and Dryad, with runtime systems providing their own re-execute based fault tolerance mechanisms, but with no analysis of their failure characteristics. Another development is the availability of failure data spanning years for systems of significant size at Los Alamos National Labs (LANL), but the time between failure (TBF) for these systems is a poor fit to the exponential distribution assumed by optimization work in checkpointing, bringing these results into question. The work in this paper describes a discrete event simulation driven by the LANL data and by models of parallel checkpointing and MapReduce tasks. The simulation allows us to then evaluate and assess the fault tolerance characteristics of these tasks with the goal of minimizing the expected running time of a parallel program in a cluster in the presence of faults for both fault tolerance models.
更多查看译文
关键词
checkpointing,discrete event simulation,exponential distribution,failure analysis,fault tolerant computing,optimisation,Dryad application,LANL data,Los Alamos National Labs,MapReduce application,cluster computing,cluster fault tolerance,discrete event simulation,expected running time minimisation,experimental evaluation,exponential distribution,failure data availability,parallel checkpointing,re-execute based fault tolerance mechanism,runtime system,time between failure
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络