Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony

2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)(2020)

引用 3|浏览26
暂无评分
摘要
Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.
更多
查看译文
关键词
message-passing solvers,local recovery,parallel solvers,message-passing codes,many-task codes,local detection,silent errors,partial-differential-equation solvers,lightweight silent-error resilience,prototype stencil solver,iterative-solver-like variant,global recovery,MPI-based solvers
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要