Modeling Application Resilience In Large-Scale Parallel Execution

Kai Wu,Wenqian Dong,Qiang Guan,Nathan DeBardeleben,Dong Li

PROCEEDINGS OF THE 47TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING（2018）

引用 4|浏览37

暂无评分

摘要

Understanding how the application is resilient to hardware and software errors is critical to high-performance computing. To evaluate application resilience, the application level fault injection is the most common method. However, the application level fault injection can be very expensive when running the application in parallel in large scales due to the high requirement for hardware resource during fault injection.In this paper, we introduce a new methodology to evaluate the resilience of the application running in large scales. Instead of injecting errors into the application in large-scale execution, we inject errors into the application in small-scale execution and serial execution to model and predict the fault injection result for the application running in large scales. Our models are based on a series of empirical observations. Those observations characterize error occurrences and propagation across MPI processes in small-scale execution (including serial execution) and large-scale one. Our models achieve high prediction accuracy. Evaluating with four NAS parallel benchmarks and two proxy scientific applications, we demonstrate that using the fault injection result to predict for 64 MPI processes, the average prediction error is 8%. While using the fault injection result to make the same prediction for eight MPI processes, the average prediction error decreases to 7%.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要