Enabling Distributed and Optimal RDMA Resource Sharing in Large-Scale Data Center Networks: Modeling, Analysis, and Implementation

IEEE-ACM TRANSACTIONS ON NETWORKING(2023)

引用 0|浏览22
暂无评分
摘要
Remote Direct Memory Access (RDMA) suffers from unfairness issues and performance degradation when multiple applications share RDMA network resources. Hence, an efficient resource scheduling mechanism is urged to optimally allocates RDMA resources among applications. However, traditional Network Utility Maximization (NUM) based solutions are inadequate for RDMA due to three challenges: 1) The standard NUM-oriented algorithm cannot deal with coupling variables introduced by multiple dependent RDMA operations; 2) The stringent constraint of RDMA on-board resources complicates the standard NUM by bringing extra optimization dimensions; 3) Naively applying traditional algorithms for NUM suffers from scalability issues in solving a large-scale RDMA resource scheduling problem. In this paper, we present how to optimally share the RDMA resources in large-scale data center networks with a distributed manner. First, we propose Distributed RDMA NUM (DRUM) to model the RDMA resource scheduling problem as a new variation of the NUM problem. Second, we present distributed algorithms to efficiently solve the large-scale, interdependent RDMA resource sharing problem for different RDMA use cases. Through theoretical analysis, the convergence and parallelism of proposed algorithms are guaranteed. Finally, we implement the algorithms as a kernel-level indirection module in the real-world RDMA environment, so as to provide end-to-end resource sharing and performance guarantee. Through extensive evaluations by large-scale simulations and testbed experiments, we show that our method significantly improves applications’ performance under resource contention, achieving $1.7-3.1\times $ higher throughput, and in a dynamic context, the largest performance improvement reaches 98.1% and 64.1% in terms of latency and throughput, respectively.
更多
查看译文
关键词
optimal rdma resource sharing,distributed,large-scale
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要