AccTFM: An Effective Intra-Layer Model Parallelization Strategy for Training Large-Scale Transformer-Based Models

IEEE Transactions on Parallel and Distributed Systems(2022)

引用 6|浏览59
暂无评分
摘要
Transformer-based deep neural networks have recently swept the field of natural language processing due to their outstanding performance, and are gradually spreading to more applications such as image/video processing. However, compared with general DNNs, training a sizeable transformer-based model is further time-consuming and memory-hungry. The existing distributed training strategies for general DNNs are not appropriate or can not efficiently handle transformer-based networks. In view of this, we propose an intra-layer model parallelization optimization strategy, AccTFM, which introduces a novel fine-grained pipeline execution and hybrid communication compression strategy to overcome the synchronization bottleneck. Specifically, on one hand, it first decouples the inter-layer computation and communication dependencies, and then searches for the optimal partitioning strategy to maximize the overlap of computation and communication. On the other hand, the hybrid communication compression module consists of token-level top- $k$ sparsification and piecewise quantization methods aiming at minimizing communication traffic. Experimental results show that AccTFM accelerates transformer-based DNNs training by up to 2.08x compared to state-of-the-art distributed training techniques.
更多
查看译文
关键词
Communication hiding,deep learning,intra-layer model parallelization,quantization,Top- k sparsification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要