HSP: Hybrid Synchronous Parallelism for Fast Distributed Deep Learning.

ICPP(2022)

引用 2|浏览20
暂无评分
摘要
In the parameter-server-based distributed deep learning system, the workers simultaneously communicate with the parameter server to refine model parameters, easily resulting in severe network contention. To solve this problem, Asynchronous Parallel (ASP) strategy enables each worker to update the parameter independently without synchronization. However, due to the inconsistency of parameters among workers, ASP experiences accuracy loss and slow convergence. In this paper, we propose Hybrid Synchronous Parallelism (HSP), which mitigates the communication contention without excessive degradation of convergence speed. Specifically, the parameter server sequentially pulls gradients from workers to eliminate network congestion and synchronizes all up-to-date parameters after each iteration. Meanwhile, HSP cautiously lets idle workers to compute with out-of-date weights to maximize the utilizations of computing resources. We provide theoretical analysis of convergence efficiency and implement HSP on popular deep learning (DL) framework. The test results show that HSP improves the convergence speedup of three classical deep learning models by up to 67%.
更多
查看译文
关键词
distributed system, deep learning, parameter server
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要