Scalable Reinforcement Learning On Cray Xc

Ananda V. Kommaraju,Kristyn J. Maschhoff,Michael F. Ringenburg,Benjamin Robbins

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE（2020）

引用 2|浏览40

暂无评分

摘要

Recent advancements in deep learning have made reinforcement learning (RL) applicable to a much broader range of decision making problems. However, the emergence of reinforcement learning workloads brings multiple challenges to system resource management. RL applications continuously train a deep learning or a machine learning model while interacting with uncertain simulation models. This new generation of AI applications impose significant demands on system resources such as memory, storage, network, and compute. In this paper, we describe a typical RL application workflow and introduce the Ray distributed execution framework developed at the UC Berkeley RISELab. Ray includes the RLlib library for executing distributed reinforcement learning applications. We describe a recipe for deploying the Ray execution framework on Cray XC systems and demonstrate scaling of RLlib algorithms across multiple nodes of the system. We also explore performance characteristics across multiple CPU and GPU node types.

查看译文

关键词

deep learning, high-performance computing, reinforcement learning, scaling

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要