TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML: TensorFlow at Scale

Thorsten Kurth,Mikhail Smorkalov,Peter Mendygral,Srinivas Sridharan,Amrita Mathuriya

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE（2019）

引用 20|浏览21

暂无评分

摘要

Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated, and thus, an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the most important of these frameworks is Google TensorFlow, which provides both features, ie, good performance as well as flexibility. In this paper, we discuss different solutions for scaling the TensorFlow Framework to thousands of nodes on contemporary Cray XC supercomputing systems.

查看译文

关键词

deep learning,performance,scalability

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要