Exascale fault tolerance challenge and approaches

2018 IEEE International Reliability Physics Symposium (IRPS)(2018)

引用 4|浏览5
暂无评分
摘要
A geometrically increasing transistor count and a stagnant fault/transistor profile create a challenge in delivering a minimum acceptable user experience for the Exascale capable supercomputer. This situation propels fault-tolerant design priorities from the back ground to the foreground. Supercomputer fault tolerance must be a first class design concern for Exascale and beyond systems. Myriad solutions exist and can touch each level of the system from transistor, to circuit, micro-architecture, architecture, OS/Driver/Library, and application. Tools and methodologies to support this global effort with sufficient precision to enable trade-offs against and optimizations around power and performance are required with accuracy targeting at least 5 years into the future.
更多
查看译文
关键词
Exascale,supercomputer,fault tolerance,resiliency,reliability,availability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要