Physics-Based Checksums for Silent-Error Detection in PDE Solvers.

Euro-Par Workshops(2019)

引用 2|浏览39
暂无评分
摘要
We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to "algorithm-based fault tolerance" checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.
更多
查看译文
关键词
Silent errors, Partial differential equations, Linear algebra, Algorithm-based fault tolerance, Checkpoint/restart
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要