Heartbeating Speculative Execution Re-execution DETECTION HANDLING RECOVERY Fault Tolerance in MapReduce

semanticscholar(2016)

引用 0|浏览1
暂无评分
摘要
Omission failures represent an important source of problems in data-intensive computing systems. In these frameworks, omission failures are caused by slow tasks, known as stragglers, which can strongly jeopardize the workload performance. In the case of MapReduce-based systems, many state-of-the-art approaches have preferred to explore and extend speculative execution mechanisms. Other alternatives have based their contributions in doubling the computing resources for their tasks. Nevertheless, none of these approaches has addressed a fundamental aspect related to the detection and further solving of the omission failures, that is, the timeout service adjustment. In this paper, we have studied the omission failures in MapReduce systems, formalizing their failure detector abstraction by means of three different algorithms for defining the timeout. The first abstraction, called High Relax Failure Detector (HR-FD), acts as a static alternative to the default timeout, which is able to estimate the completion time for the user workload. The second abstraction, called Medium Relax Failure Detector (MR-FD), dynamically modifies the timeout, according to the progress score of each workload. Finally, taking into account that some of the user requests are strictly deadline-bounded, we have introduced the third abstraction, called Low Relax Failure Detector (LR-FD), which is able to merge the MapReduce dynamic timeout with an external monitoring system, in order to enforce more accurate failure detections. Whereas HR-FD shows performance improvements for most of the user request (in particular, small workloads), MR-FD and LR-FD enhance significantly the current timeout selection, for any kind of scenario, regardless of the workload type and failure injection time.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要