Training Google Neural Machine Translation on an Intel CPU Cluster

Dhiraj D. Kalamkar,Kunal Banerjee,Sudarshan Srinivasan,Srinivas Sridharan,Evangelos Georganas,Mikhail E. Smorkalov,Cong Xu,Alexander Heinecke

2019 IEEE International Conference on Cluster Computing (CLUSTER)（2019）

引用 5|浏览88

暂无评分

摘要

Google's neural machine translation (GNMT) is state-of-the-art recurrent neural network (RNN/LSTM) based language translation application. It is computationally more demanding than well-studied convolutional neural networks (CNNs). Also, in contrast to CNNs, RNNs heavily mix compute and memory bound layers which requires careful tuning on a latency machine to optimally use fast on-die memories for best single processor performance. Additionally, due to massive compute demand, it is essential to distribute the entire workload among several processors and even compute nodes. To the best of our knowledge, this is the first work which attempts to scale this application on an Intel CPU cluster. Our CPU-based GNMT optimization, the first of its kind, achieves this by the following steps: (i) we choose a monolithic long short-term memory (LSTM) cell implementation from LIBXSMM library (specifically tuned for CPUs) and integrate it into TensorFlow, (ii) we modify GNMT code to use fused time step LSTM op for the encoding stage, (iii) we combine Horovod and Intel MLSL scaling libraries for improved performance on multiple nodes, and (iv) we extend the bucketing logic for grouping similar length sentences together to multiple nodes for achieving load balance across multiple ranks. In summary, we demonstrate that due to these changes we are able to outperform Google's stock CPU-based GNMT implementation by ~2x on single node and potentially enable more than 25x speedup using 16 node CPU cluster.

查看译文

关键词

machine translation,recurrent neural networks,TensorFlow,LIBXSMM,Intel architecture

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要