Understanding Node Allocation on Leadership-Class Supercomputers with Graph Analytics.

Andy Trinh, Shivam Sheth,Anil Gaihre,Caiwen Ding,Jieyang Chen,Feiyi Wang,David Pugmire,Scott Klasky, Hang Liu,Lipeng Wan

IEEE International Conference on Smart City（2023）

引用 0|浏览1

暂无评分

摘要

As the scale of modern high-performance computing (HPC) systems keeps growing, job scheduling on those systems becomes extremely challenging. Particularly, one of the important tasks job schedulers need to fulfill is to optimize the node allocation to improve the jobs' execution efficiency. In order to optimize the node allocation, the job scheduling strategy must take the network topology of the HPC system into consideration. However, existing approaches are either designed for the specific network typologies (lack of generality) or rely on the applications' communication patterns (unknown without running on HPC). In this paper, we propose a generic topology-aware node allocation strategy based on graph algorithms. Our strategy can reduce the intra-job communication overhead and the inter-job communication interference by selecting nodes that form a sub-graph with much smaller diameter. We also propose and study four different initialization strategies for our node allocation algorithm to understand how different initialization strategies affect the node allocation results and speed. We evaluate the proposed methods using 30 days of real job traces collected from the OLCF's Titan supercomputer. Compared to the native job scheduling strategy used on Titan, adopting our approach can achieve a 2.5 × diameter reduction on average, and for certain jobs the diameter reduction can be up to 8 ×.

查看译文

关键词

HPC,job scheduling,topology-aware allocation,Graph Analytics

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要