Deep Learning for Supercomputers: Distributed Tensor Layouts Define Distributed Computation

neural information processing systems(2018)

引用 22|浏览146
暂无评分
摘要
Data-parallelism is the dominant distributed DNN training strategy, due to its universal applicability across a wide range of model and hardware architectures. However, memory constraints prevent its application for training very large models, which have been shown in many domains to produce superior results. Model-parallelism can solve this problem, also reducing step times during training and inference. Unfortunately, model-parallel algorithms tend to be complicated to discover, describe, and to implement, and do not generalize well across model types and hardware types. We solve this problem by introducing a language for simply specifying distributed tensor computations (model-parallel and/or data-parallel) across an n-dimensional mesh of processors by specifying the distributed storage layouts (split and/or replicated) of the tensors. The computation is then compiled into processor-local operations, coupled with collective communication primitives such as Allreduce. Using our new language, we demonstrate very short specifications of a variety of data-parallel and/or model-parallel DNN training algorithms on both a two-layer example model and the Transformer cite{Vaswani17} sequence-to-sequence model. This allows us to train Transformer models with up to 5 billion parameters on up to 256-node clusters, surpassing SOTA results on WMT14 En-Fr and En-De translation tasks, as well as the one-billion-word language modeling benchmark.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要