Replication-Based Fault-Tolerance For Large-Scale Graph Processing

DSN(2018)

引用 63|浏览53
暂无评分
摘要
The increasing algorithmic complexity and dataset sizes necessitate the use of networked machines for many graph-parallel algorithms, which also makes fault tolerance a must due to the increasing scale of machines. Unfortunately, existing large-scale graph-parallel systems usually adopt a distributed checkpoint mechanism for fault tolerance, which incurs not only notable performance overhead but also lengthy recovery time. This paper observes that the vertex replicas created for distributed graph computation can be naturally extended for fast in-memory recovery of graph states. This paper describes Imitator, a new fault tolerance mechanism, which supports cheap maintenance of vertex states by replicating them to their replicas during normal message exchanges, and provides fast in-memory reconstruction of failed vertices from replicas in other machines. Imitator has been implemented on Cyclops with edge-cut and PowerLyra with vertex-cut. Evaluation on a 50-node EC-2 like cluster shows that Imitator incurs an average of 1.37 and 2.32 percent performance overhead (ranging from -0.6 to 3.7 percent) for Cyclops and PowerLyra respectively, and can recover from failures of more than one million of vertices with less than 3.4 seconds.
更多
查看译文
关键词
Graph-parallel system,fault-tolerance,replication
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要