## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# High-Throughput Synchronous Deep RL

NIPS 2020, (2020)

EI

Keywords

Abstract

Deep reinforcement learning (RL) is computationally demanding and requires processing of many data points. Synchronous methods enjoy training stability while having lower data throughput. In contrast, asynchronous methods achieve high throughput but suffer from stability issues and lower sample efficiency due to `stale policies.' To com...More

Code:

Data:

Introduction

- Deep reinforcement learning (RL) has been impressively successful on a wide variety of tasks, including playing of video games [1, 6, 12, 13, 20, 21, 23,24,25, 32] and robotic control [10, 18, 22].
- Synchronous methods suffer from idle time as all actors need to finish experience collection before trainable parameters are updated.
- This is problematic when the time for an environment step varies significantly.
- Reinforcement Learning: An agent interacts with an environment, collecting rewards over discrete time, as formalized by a Markov Decision Process (MDP).
- This policy π maps the current state st ∈ S to a probability distribution over the action space A

Highlights

- Deep reinforcement learning (RL) has been impressively successful on a wide variety of tasks, including playing of video games [1, 6, 12, 13, 20, 21, 23,24,25, 32] and robotic control [10, 18, 22]
- To counter the often excessive training time, RL frameworks aim for two properties: (1) A high throughput which ensures that the framework collects data at very high rates
- Asynchronous methods trade training stability for throughput. We show that this trade-off is not necessary and propose High-Throughput Synchronous RL (HTSRL), a technique, which achieves both high throughput and high training stability
- We show that High-Throughput Synchronous RL (HTS-RL) permits to speedup A2C and policy optimization algorithms (PPO) on Atari environments [2, 3] and the Google Research Football environment (GFootball) [15]
- To confirm that Claim 1 holds for complex environments such as Atari and GFootball, we study the speedup ratio of HTS-RL to A2C/PPO baselines when the step time variance changes
- In environments with large step time variance, HTS-RL is more than 5×

Methods

- The authors aim for the following four features: (1) batch synchronization which reduces actor idle time, (2) learning and rollout take place concurrently which increases throughput, (3) guaranteed lag of only one step between the behavior and target policy which ensures stability of training, (4) asynchronous interaction between actors and executors at the rollout phase to increase throughput while ensuring determinism.
- As shown in Fig. 1(e), executors asynchronously grab actions from the action buffer, which stores actions predicted by actors as well as a pointer to the environment.
- The executors store the received state and an environment pointer within the state buffer, from which actors grab this information asynchronously.
- The actors use the grabbed states to predict the corresponding subsequent actions and send those together with the environment pointer back to the action buffer.
- Empty goal close Empty goal Run to score RSK PSK RPSK 3 vs. 1 w/ keeper Corner Counterattack easy Counterattack hard 11 vs 11 w/ lazy Opp.

Results

- To confirm that Claim 1 holds for complex environments such as Atari and GFootball, the authors study the speedup ratio of HTS-RL to A2C/PPO baselines when the step time variance changes.
- As shown in Fig. 4(left), in environments with small variance, HTS-RL is around 1.5× faster than the baselines.
- In environments with large step time variance, HTS-RL is more than 5×

Conclusion

- The authors develop High-Throughput Synchronous RL (HTS-RL).
- It achieves a high throughput while maintaining data efficiency.
- For this HTS-RL performs batch synchronization, and concurrent rollout and learning.
- HTS-RL avoids the ‘stale-policy’ issue

Summary

## Introduction:

Deep reinforcement learning (RL) has been impressively successful on a wide variety of tasks, including playing of video games [1, 6, 12, 13, 20, 21, 23,24,25, 32] and robotic control [10, 18, 22].- Synchronous methods suffer from idle time as all actors need to finish experience collection before trainable parameters are updated.
- This is problematic when the time for an environment step varies significantly.
- Reinforcement Learning: An agent interacts with an environment, collecting rewards over discrete time, as formalized by a Markov Decision Process (MDP).
- This policy π maps the current state st ∈ S to a probability distribution over the action space A
## Methods:

The authors aim for the following four features: (1) batch synchronization which reduces actor idle time, (2) learning and rollout take place concurrently which increases throughput, (3) guaranteed lag of only one step between the behavior and target policy which ensures stability of training, (4) asynchronous interaction between actors and executors at the rollout phase to increase throughput while ensuring determinism.- As shown in Fig. 1(e), executors asynchronously grab actions from the action buffer, which stores actions predicted by actors as well as a pointer to the environment.
- The executors store the received state and an environment pointer within the state buffer, from which actors grab this information asynchronously.
- The actors use the grabbed states to predict the corresponding subsequent actions and send those together with the environment pointer back to the action buffer.
- Empty goal close Empty goal Run to score RSK PSK RPSK 3 vs. 1 w/ keeper Corner Counterattack easy Counterattack hard 11 vs 11 w/ lazy Opp.
## Results:

To confirm that Claim 1 holds for complex environments such as Atari and GFootball, the authors study the speedup ratio of HTS-RL to A2C/PPO baselines when the step time variance changes.- As shown in Fig. 4(left), in environments with small variance, HTS-RL is around 1.5× faster than the baselines.
- In environments with large step time variance, HTS-RL is more than 5×
## Conclusion:

The authors develop High-Throughput Synchronous RL (HTS-RL).- It achieves a high throughput while maintaining data efficiency.
- For this HTS-RL performs batch synchronization, and concurrent rollout and learning.
- HTS-RL avoids the ‘stale-policy’ issue

- Table1: Atari experiment in final time metrics: Average evaluation rewards achieved given limited training time
- Table2: GFootball results in required time metrics: required time (minutes) to achieve scores (time to achieve score 0.4 / time to achieve score 0.8)
- Table3: Average game score of 8M step multi-agent training with raw image input on ‘3 vs. 1 with keeper’ (final metrics)
- Table4: Different number of actors on ‘3 vs. 1 with keeper’. Average scores: average score of 100 evaluation episodes
- Table5: Different synchronization interval on ‘3 vs. 1 with keeper’. Average scores: average score of 100 evaluation episodes

Related work

- In the following we briefly review work on asynchronous and synchronous reinforcement learning. Asynchronous reinforcement learning: Asynchronous advantage actor-critic (A3C) [25] is an asynchronous multi-process variant of the advantage actor-critic algorithm [30]. A3C runs on a single machine and does not employ GPUs. As illustrated in Fig. 1(a), in A3C, each process is an actor-learner pair which updates the trainable parameters asynchronously. Specifically, in each process the actor collects data by interacting with the environment for a number of steps. The learner uses the collected data to compute the gradient. Then the gradient is applied to a shared model which is accessible by all processes. One major advantage of A3C: throughput increases almost linearly with the number of processes as no synchronization is used. See Fig. 2(a) for an illustration of the timing.

Funding

- This work is supported in part by NSF under Grant # 1718221, 2008387 and MRI #1725729, NIFA award 2020-67021-32799, UIUC, Samsung, Amazon, 3M, Cisco Systems Inc. (Gift Award CG 1377144), and a Google PhD Fellowship to RY

Study subjects and analysis

bootstrap samples: 10000

All experiments are repeated for five runs with different random seeds. All the plots in Fig. 5 include mean of the five runs and the 95% confidence interval obtained by using the Facebook Bootstrapped implementation with 10,000 bootstrap samples. For Atari experiments, we follow the conventional ‘no-op’ procedure, i.e., at the beginning of each evaluation episode, the agents perform up to 30 ‘no-op’ actions

Reference

- M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz. Reinforcement learning through asynchronous advantage actor-critic on a GPU. In Proc. ICLR, 2017.
- M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 2013.
- G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym, 2016.
- C. Colas, O. Sigaud, and P.-Y. Oudeyer. GEP-PG: Decoupling exploration and exploitation in deep reinforcement learning algorithms. In Proc. ICML, 2018.
- L. de Haan and A. Ferreira. Extreme Value Theory: An Introduction. Springer, 2006.
- P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov. OpenAI baselines, 2017.
- L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proc. ICML, 2018.
- L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, and M. Michalski. Seed rl: Scalable and efficient deep-rl with accelerated central inference. In arXiv., 2020.
- R. A. Fisher and L. H. C. Tippett. Limiting forms of the frequency distribution of the largest or smallest member of a sample. Math. Proc. Cambridge Philos. Soc., 1928.
- S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proc. ICRA, 2017.
- P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In Proc. AAAI, 2017.
- U. Jain, L. Weihs, E. Kolve, M. Rastegari, S. Lazebnik, A. Farhadi, A. Schwing, and A. Kembhavi. Two body problem: Collaborative visual task completion. In Proc. CVPR, 2019.
- U. Jain∗, L. Weihs∗, E. Kolve, A. Farhadi, S. Lazebnik, A. Kembhavi, and A. G. Schwing. A Cordial Sync: Going Beyond Marginal Policies For Multi-Agent Embodied Tasks. In Proc. ECCV, 2020. ∗ equal contribution.
- I. Kostrikov. Pytorch implementations of reinforcement learning algorithms, 2018.
- K. Kurach, A. Raichuk, P. Stanczyk, M. Zajac, O. Bachem, L. Espeholt, C. Riquelme, D. Vincent, M. Michalski, O. Bousquet, and S. Gelly. Google research football: A novel reinforcement learning environment. arXiv., 2019.
- H. Küttler, N. Nardelli, T. Lavril, M. Selvatici, V. Sivakumar, T. Rocktäschel, and E. Grefenstette. TorchBeast: A PyTorch Platform for Distributed RL. arXiv., 2019.
- J. Langford, A. Smola, and M. Zinkevich. Slow learners are fast. arXiv preprint arXiv:0911.0491, 2009.
- S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. In Proc. ICRA, 2015.
- Y. Li, I.-J. Liu, Y. Yuan, D. Chen, A. Schwing, and J. Huang. Accelerating distributed reinforcement learning with in-switch computing. In Proc. ISCA, 2019.
- I.-J. Liu, J. Peng, and A. Schwing. Knowledge flow: Improve upon your teachers. In Proc. ICLR, 2019.
- I.-J. Liu, R. A. Yeh, and A. G. Schwing. Pic: Permutation invariant critic for multi-agent deep reinforcement learning. In Proc. CoRL, 2019.
- J. Luo, E. Solowjow, C. Wen, J. A. Ojea, A. M. Agogino, A. Tamar, and P. Abbeel. Reinforcement learning on variable impedance controller for high-precision robotic assembly. In Proc. ICRA, 2019.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NeurIPS Deep Learning Workshop, 2013.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
- V. Mnih, Adrià, P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proc. ICML, 2016.
- J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policy optimization. In Proc. ICML, 2015.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. In arxiv, 2017.
- D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. Nature, 2017.
- A. Stooke and P. Abbeel. rlpyt: A research code base for deep reinforcement learning in pytorch. In arXiv., 2019.
- R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2018.
- R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proc. NeurIPS, 2000.
- O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019.
- Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient actor-critic with experience replay. In Proc. ICLR, 2017.
- E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra. DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. In Proc. ICLR, 2020.
- Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In Proc. NeurIPS, 2017.

Tags

Comments