LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming
CoRR(2024)
摘要
The shift towards high-bandwidth networks driven by AI workloads in data
centers and HPC clusters has unintentionally aggravated network latency,
adversely affecting the performance of communication-intensive HPC
applications. As large-scale MPI applications often exhibit significant
differences in their network latency tolerance, it is crucial to accurately
determine the extent of network latency an application can withstand without
significant performance degradation. Current approaches to assessing this
metric often rely on specialized hardware or network simulators, which can be
inflexible and time-consuming. In response, we introduce LLAMP, a novel
toolchain that offers an efficient, analytical approach to evaluating HPC
applications' network latency tolerance using the LogGPS model and linear
programming. LLAMP equips software developers and network architects with
essential insights for optimizing HPC infrastructures and strategically
deploying applications to minimize latency impacts. Through our validation on a
variety of MPI applications like MILC, LULESH, and LAMMPS, we demonstrate our
tool's high accuracy, with relative prediction errors generally below 2
Additionally, we include a case study of the ICON weather and climate model to
illustrate LLAMP's broad applicability in evaluating collective algorithms and
network topologies.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要