GreedW: A Flexible and Efficient Decentralized Framework for Distributed Machine Learning

IEEE TRANSACTIONS ON COMPUTERS(2024)

引用 0|浏览3
暂无评分
摘要
With the ever-increasing demand for computing power in deep learning, distributed training techniques have proven to be effective in meeting these demands. However, current existing state-of-the-art distributed training frameworks, such as Parameter Server (PS), Ring-All-Reduce, and their varieties, still face significant challenges. In particular, the existence of communication bottlenecks can severely limit the efficiency and scalability of distributed training frameworks, making it difficult to fully and effectively exert the computing power of large-scale clusters, especially in the presence of dynamic and ever-changing network environments. To address these issues and further maximize the utilization of the computing power of clusters, in this paper we propose an efficient and dynamic distributed training framework, named GreedW. GreedW can greatly improve the training efficiency of workers by dynamically constructing an adaptive customized communication network and adaptively scheduling the workload. Specifically, GreedW employs a greedy strategy to dynamically construct the communication network tree in each iteration for gradient transmission with minimum communication cost and applies a heterogeneity-aware workload allocation scheme to adaptively balance the heavy traffic across heterogeneous workers in the cluster taking into account the available computing capabilities of each node, which effectively alleviates the network bottleneck. It is worth noting that GreedW is enabled to dynamically adjust the assigned job on each worker node based on their completion time during each round of model aggregation to ensure that each worker node completes its assignments around the same time, thus mitigating the intractable straggler issue and minimizing their idle waiting time. Comprehensive experimental evaluations on three different-scaled training models (i.e., Mnist-2NN, Mnist-CNN, and TextCNN) for image recognition and natural language processing tasks demonstrate that GreedW outperforms the existing state-of-the-art frameworks in terms of training efficiency, system flexibility, and robustness.
更多
查看译文
关键词
Training,Costs,Computational modeling,Task analysis,Servers,Adaptation models,Pipelines,Distributed machine learning,heterogeneous cluster,workload allocation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要