Lessons Learned Implementing User-Level Failure Mitigation In Mpich

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing(2015)

引用 17|浏览99
暂无评分
摘要
User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high-performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low.
更多
查看译文
关键词
ulfm,fault tolerance,mpi,mpich
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要