Analysis of Checkpoint I/O Behavior

international conference on computational science(2020)

引用 2|浏览14
暂无评分
摘要
Nowadays, checkpoints have gained some relevance, given the increasing complexity of scientific applications for the use of many resources over a long period of time. Thus, in fault tolerance strategies, in addition to taking into account the impact that the application itself has on HPC systems, we must add the impact of the checkpoint. The checkpoint saves information about the application and the system in order to be able to restore the application, if necessary, in stable storage. The checkpoint can be considered as an intensive I/O application, so its storage need can have a great impact on the application. Therefore, in this paper, the analysis of the checkpoint's I/O behavior is presented. The number of checkpoints to be performed in an application is often related to the maximum overhead that you want to introduce in the application. If we know the maximum overload the user wants to pay for and the overhead that a checkpoint introduces, we can calculate the number of checkpoints to be performed. This overhead depends significantly on the I/O operations. The PIOM-PX tool was used to analyze the spatial and temporal I/O patterns of the checkpoint. Based on this analysis, a model was designed to predict their behavior. This information is used to calculate the number of checkpoints to be performed in an application given a maximum overhead predefined by the user. This will allow us to understand what happens when a checkpoint is created in an HPC system, in order to make decisions that adapt to the user's requirements.
更多
查看译文
关键词
Checkpoint, Fault tolerance, I/O behavior, PIOM-PX
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要