Detecting Scale-Induced Overflow Bugs in Production HPC Codes.

Justs Zarins,Michèle Weiland,Paul Bartholomew, Leigh Lapworth,Mark Parsons

ISC Workshops（2022）

引用 0|浏览1

暂无评分

摘要

Scaling bugs - errors that only manifest at large scale simulations, in terms of number of parallel workers or input size - are critical to detect early in the testing of HPC codes. If missed, these bugs can cause applications to either crash at runtime during production runs or, even worse, silently continue and corrupt results. This results in wasting vast amounts of resources and the crash might not provide any useful debugging information. Laguna et al. presented a method for solving this in [13] using an approach where scale variables are traced throughout an application statically and potentially overflowing instructions are detected, with further refinements done by running a few small scale experiments. However, their algorithm is not able to trace multiple code patterns found in production HPC applications, for example code modularity, and has not been applied to Fortran applications. We present an extension to their algorithm which addresses these issues thus enabling us to find scaling bugs in complex real applications where they could not be found before. The key features that enable this are backward/forward tracing and optimistic GEP comparison.

查看译文

关键词

Scaling bugs, Correctness, LLVM

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要