Energy-Efficient Localised Rollback Via Data Flow Analysis And Frequency Scaling

EUROMPI 2018: PROCEEDINGS OF THE 25TH EUROPEAN MPI USERS' GROUP MEETING(2018)

引用 9|浏览13
暂无评分
摘要
Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data flow rollback (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data flow graphs. We demonstrate the benefits of DFR for an MPI stencil code by localising rollback, and then reduce energy consumption by 10-12% on idling nodes via frequency scaling. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n2 for a process count n.
更多
查看译文
关键词
Fault Tolerance, Checkpoint/Restart, MPI, Data Flow, Discrete-Event Simulator, Stencil Applications, Frequency Scaling, Energy Efficiency
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要