Deep Learning for Supercomputers: Distributed Tensor Layouts Define Distributed Computation

Noam Shazeer,Youlong Cheng,Niki Parmar,Dustin Tran,Ashish Vaswani,Penporn Koanantakool,Peter Hawkins,HyoukJoong Lee,Mingsheng Hong,Cliff Young,Ryan Sepassi,Blake Hechtman

neural information processing systems（2018）

引用 22|浏览146

暂无评分

摘要

Data-parallelism is the dominant distributed DNN training strategy, due to its universal applicability across a wide range of model and hardware architectures. However, memory constraints prevent its application for training very large models, which have been shown in many domains to produce superior results. Model-parallelism can solve this problem, also reducing step times during training and inference. Unfortunately, model-parallel algorithms tend to be complicated to discover, describe, and to implement, and do not generalize well across model types and hardware types. We solve this problem by introducing a language for simply specifying distributed tensor computations (model-parallel and/or data-parallel) across an n-dimensional mesh of processors by specifying the distributed storage layouts (split and/or replicated) of the tensors. The computation is then compiled into processor-local operations, coupled with collective communication primitives such as Allreduce. Using our new language, we demonstrate very short specifications of a variety of data-parallel and/or model-parallel DNN training algorithms on both a two-layer example model and the Transformer cite{Vaswani17} sequence-to-sequence model. This allows us to train Transformer models with up to 5 billion parameters on up to 256-node clusters, surpassing SOTA results on WMT14 En-Fr and En-De translation tasks, as well as the one-billion-word language modeling benchmark.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要