FINJ: A Fault Injection Tool for HPC Systems

Lecture Notes in Computer Science(2018)

引用 16|浏览83
暂无评分
摘要
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing users to create and monitor a variety of highly-complex and diverse fault conditions in HPC systems that would be difficult to recreate in practice. FINJ is suitable for experiments involving many, potentially interacting nodes, making it a very versatile design and evaluation tool.
更多
查看译文
关键词
Exascale systems,Resiliency,Fault detection,Monitoring,Benchmarking,Open-source
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要