HAL: Computer System for Scalable Deep Learning

Volodymyr Kindratenko,Dawei Mu, Yan Zhan, John Maloney,Sayed Hadi Hashemi,Benjamin Rabe,Ke Xu,Roy Campbell,Jian Peng,William Gropp

PEARC '20: Practice and Experience in Advanced Research Computing Portland OR USA July, 2020（2020）

引用 32|浏览39

暂无评分

摘要

We describe the design, deployment and operation of a computer system built to efficiently run deep learning frameworks. The system consists of 16 IBM POWER9 servers with 4 NVIDIA V100 GPUs each, interconnected with Mellanox EDR InfiniBand fabric, and a DDN all-flash storage array. The system is tailored towards efficient execution of the IBM Watson Machine Learning enterprise software stack that combines popular open-source deep learning frameworks. We build a custom management software stack to enable an efficient use of the system by a diverse community of users and provide guides and recipes for running deep learning workloads at scale utilizing all available GPUs. We demonstrate scaling of a PyTorch and TensorFlow based deep neural networks to produce state-of-the-art performance results.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要