A Directive-Based Approach to Perform Persistent Checkpoint/Restart

2017 International Conference on High Performance Computing & Simulation (HPCS)(2017)

引用 4|浏览26
暂无评分
摘要
Exascale platforms require support for resilience capabilities due to increasing numbers of components and associated error rates. In this paper, we present a new directive-based approach to perform application-level checkpoint/restart in a simplified and portable way. We propose a solution based on compiler directives, similar to OpenMP, that allows users to easily specify the state of the application that has to be saved and restored. This leaves the tedious and error-prone serialization and deserialization activities to our library, which relies on SCR/FTI to perform scalable and efficient I/O operations. Our results, based on several benchmarks and two large applications, reveal no additional overhead compared to the direct use of FTI and SCR checkpoint/restart libraries. Apart from that, our portable approach significantly increases the programmability reducing the number of code lines required to perform checkpoint/restart in an average of ≈ 82% and ≈ 94%, for FTI and SCR respectively.
更多
查看译文
关键词
checkpoint/restart,resiliency,fault tolerance,ex-ascale,programmability,programming models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要