Anthropomorphic diagnosis of runtime hidden behaviors in OpenMP multi-threaded applications

Weidong Wang, Dian Li, Wangda Luo, Yujian Kang,Liqiang Wang

J. Parallel Distributed Comput.(2023)

引用 1|浏览16
Extreme-scale computing involves hundreds of millions of threads with multi-level parallelism running on large-scale hierarchical and heterogeneous hardware. Some OpenMP multi-threaded applications increasingly suffer from runtime hidden behaviors owning to shared resource contention as well as software-and hardware-related problems. Such hidden behaviors can result in failure and inefficiencies and are among the main challenges in system resiliency. To minimize the impact of hidden behaviors, one must quickly and accurately detect and diagnose the hidden behaviors that cause the failures. However, it is difficult to identify hidden behaviors in the dynamic and noisy data collected by OpenMP multi-threaded monitoring infrastructures. This paper presents an anthropomorphic diagnosis framework for hidden behaviors of OpenMP multi-threaded applications. In the framework, we first design injected heartbeat functions for OpenMP multi-threaded applications. Then, we leverage the heartbeat sequences to extract features of hidden behaviors. Finally, we develop a feature learning-based algorithm using heartbeat analysis, namely HSA, to diagnose hidden behaviors. To evaluate our framework, the NAS Parallel NPB benchmark, EPCC OpenMP micro-benchmark suite, and Jacobi benchmark are used to test the performance of our proposed framework. The experimental results demonstrate that our framework successfully identifies 90.3% of the injected hidden behaviors of OpenMP multi-threaded applications while acquiring low overhead.(c) 2023 Elsevier Inc. All rights reserved.
High performance computing,OpenMP,Machine learning,Heartbeat,Hidden behaviors
AI 理解论文