Evaluating the Resilience of Parallel Applications

2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)(2018)

引用 4|浏览88
暂无评分
摘要
Reliability is a significant design constraint for supercomputers and large-scale data centers. Modeling the effects of faults on applications targeted to such systems allows system architects and software designers to provision resilience features, that improve fidelity of results and reduce runtimes. In this paper, we propose mechanisms to improve existing techniques to model the effect of transient faults on realistic applications. First, we extend the existing Program Vulnerability Factor metric to model multi-threaded applications. Then we demonstrate how to measure the multi-threaded PVF of an application in simulation and introduce the ability to account for software detection of hardware faults, differentiating faults that cause detected, uncorrected errors (DUE) from faults that cause silent data corruption (SDC).
更多
查看译文
关键词
parallel applications,supercomputers,large-scale data centers,transient faults,multithreaded applications,multithreaded PVF,software detection,hardware faults,silent data corruption,program vulnerability factor
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要