Optimizing MPI Collectives on Shared Memory Multi-Cores

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(2023)

引用 0|浏览12
暂无评分
摘要
Message Passing Interface (MPI) programs often experience performance slowdowns due to collective communication operations, like broadcasting and reductions. As modern CPUs integrate more processor cores, running multiple MPI processes on shared-memory machines to take advantage of hardware parallelism is becoming increasingly common. In this context, it is crucial to optimize MPI collective communications for shared-memory execution. However, existing MPI collective implementations on shared-memory systems have two primary drawbacks. The first is extensive redundant data movements when performing reduction collectives, and the second is the ineffective use of non-temporal instructions to optimize streamed data processing. To address these limitations, this paper proposes two optimization techniques that minimize data movements and enhance the use of non-temporal instructions. We evaluated our techniques by integrating them into the OpenMPI library and tested their performance using micro-benchmarks and real-world applications running on two multi-core clusters. Experimental results show that our approach significantly outperforms existing techniques, yielding a 1.2--6.4x performance improvement.
更多
查看译文
关键词
MPI,Collective Communication,Memory Access,Optimization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要