AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
To confirm that Claim 1 holds for complex environments such as Atari and GFootball, we study the speedup ratio of HTS-reinforcement learning to A2C/policy optimization algorithms baselines when the step time variance changes

High-Throughput Synchronous Deep RL

NIPS 2020, (2020)

Cited by: 0|Views32
EI
Full Text
Bibtex
Weibo

Abstract

Deep reinforcement learning (RL) is computationally demanding and requires processing of many data points. Synchronous methods enjoy training stability while having lower data throughput. In contrast, asynchronous methods achieve high throughput but suffer from stability issues and lower sample efficiency due to `stale policies.' To com...More

Code:

Data:

0
Introduction
  • Deep reinforcement learning (RL) has been impressively successful on a wide variety of tasks, including playing of video games [1, 6, 12, 13, 20, 21, 23,24,25, 32] and robotic control [10, 18, 22].
  • Synchronous methods suffer from idle time as all actors need to finish experience collection before trainable parameters are updated.
  • This is problematic when the time for an environment step varies significantly.
  • Reinforcement Learning: An agent interacts with an environment, collecting rewards over discrete time, as formalized by a Markov Decision Process (MDP).
  • This policy π maps the current state st ∈ S to a probability distribution over the action space A
Highlights
  • Deep reinforcement learning (RL) has been impressively successful on a wide variety of tasks, including playing of video games [1, 6, 12, 13, 20, 21, 23,24,25, 32] and robotic control [10, 18, 22]
  • To counter the often excessive training time, RL frameworks aim for two properties: (1) A high throughput which ensures that the framework collects data at very high rates
  • Asynchronous methods trade training stability for throughput. We show that this trade-off is not necessary and propose High-Throughput Synchronous RL (HTSRL), a technique, which achieves both high throughput and high training stability
  • We show that High-Throughput Synchronous RL (HTS-RL) permits to speedup A2C and policy optimization algorithms (PPO) on Atari environments [2, 3] and the Google Research Football environment (GFootball) [15]
  • To confirm that Claim 1 holds for complex environments such as Atari and GFootball, we study the speedup ratio of HTS-RL to A2C/PPO baselines when the step time variance changes
  • In environments with large step time variance, HTS-RL is more than 5×
Methods
  • The authors aim for the following four features: (1) batch synchronization which reduces actor idle time, (2) learning and rollout take place concurrently which increases throughput, (3) guaranteed lag of only one step between the behavior and target policy which ensures stability of training, (4) asynchronous interaction between actors and executors at the rollout phase to increase throughput while ensuring determinism.
  • As shown in Fig. 1(e), executors asynchronously grab actions from the action buffer, which stores actions predicted by actors as well as a pointer to the environment.
  • The executors store the received state and an environment pointer within the state buffer, from which actors grab this information asynchronously.
  • The actors use the grabbed states to predict the corresponding subsequent actions and send those together with the environment pointer back to the action buffer.
  • Empty goal close Empty goal Run to score RSK PSK RPSK 3 vs. 1 w/ keeper Corner Counterattack easy Counterattack hard 11 vs 11 w/ lazy Opp.
Results
  • To confirm that Claim 1 holds for complex environments such as Atari and GFootball, the authors study the speedup ratio of HTS-RL to A2C/PPO baselines when the step time variance changes.
  • As shown in Fig. 4(left), in environments with small variance, HTS-RL is around 1.5× faster than the baselines.
  • In environments with large step time variance, HTS-RL is more than 5×
Conclusion
  • The authors develop High-Throughput Synchronous RL (HTS-RL).
  • It achieves a high throughput while maintaining data efficiency.
  • For this HTS-RL performs batch synchronization, and concurrent rollout and learning.
  • HTS-RL avoids the ‘stale-policy’ issue
Summary
  • Introduction:

    Deep reinforcement learning (RL) has been impressively successful on a wide variety of tasks, including playing of video games [1, 6, 12, 13, 20, 21, 23,24,25, 32] and robotic control [10, 18, 22].
  • Synchronous methods suffer from idle time as all actors need to finish experience collection before trainable parameters are updated.
  • This is problematic when the time for an environment step varies significantly.
  • Reinforcement Learning: An agent interacts with an environment, collecting rewards over discrete time, as formalized by a Markov Decision Process (MDP).
  • This policy π maps the current state st ∈ S to a probability distribution over the action space A
  • Methods:

    The authors aim for the following four features: (1) batch synchronization which reduces actor idle time, (2) learning and rollout take place concurrently which increases throughput, (3) guaranteed lag of only one step between the behavior and target policy which ensures stability of training, (4) asynchronous interaction between actors and executors at the rollout phase to increase throughput while ensuring determinism.
  • As shown in Fig. 1(e), executors asynchronously grab actions from the action buffer, which stores actions predicted by actors as well as a pointer to the environment.
  • The executors store the received state and an environment pointer within the state buffer, from which actors grab this information asynchronously.
  • The actors use the grabbed states to predict the corresponding subsequent actions and send those together with the environment pointer back to the action buffer.
  • Empty goal close Empty goal Run to score RSK PSK RPSK 3 vs. 1 w/ keeper Corner Counterattack easy Counterattack hard 11 vs 11 w/ lazy Opp.
  • Results:

    To confirm that Claim 1 holds for complex environments such as Atari and GFootball, the authors study the speedup ratio of HTS-RL to A2C/PPO baselines when the step time variance changes.
  • As shown in Fig. 4(left), in environments with small variance, HTS-RL is around 1.5× faster than the baselines.
  • In environments with large step time variance, HTS-RL is more than 5×
  • Conclusion:

    The authors develop High-Throughput Synchronous RL (HTS-RL).
  • It achieves a high throughput while maintaining data efficiency.
  • For this HTS-RL performs batch synchronization, and concurrent rollout and learning.
  • HTS-RL avoids the ‘stale-policy’ issue
Tables
  • Table1: Atari experiment in final time metrics: Average evaluation rewards achieved given limited training time
  • Table2: GFootball results in required time metrics: required time (minutes) to achieve scores (time to achieve score 0.4 / time to achieve score 0.8)
  • Table3: Average game score of 8M step multi-agent training with raw image input on ‘3 vs. 1 with keeper’ (final metrics)
  • Table4: Different number of actors on ‘3 vs. 1 with keeper’. Average scores: average score of 100 evaluation episodes
  • Table5: Different synchronization interval on ‘3 vs. 1 with keeper’. Average scores: average score of 100 evaluation episodes
Download tables as Excel
Related work
  • In the following we briefly review work on asynchronous and synchronous reinforcement learning. Asynchronous reinforcement learning: Asynchronous advantage actor-critic (A3C) [25] is an asynchronous multi-process variant of the advantage actor-critic algorithm [30]. A3C runs on a single machine and does not employ GPUs. As illustrated in Fig. 1(a), in A3C, each process is an actor-learner pair which updates the trainable parameters asynchronously. Specifically, in each process the actor collects data by interacting with the environment for a number of steps. The learner uses the collected data to compute the gradient. Then the gradient is applied to a shared model which is accessible by all processes. One major advantage of A3C: throughput increases almost linearly with the number of processes as no synchronization is used. See Fig. 2(a) for an illustration of the timing.
Funding
  • This work is supported in part by NSF under Grant # 1718221, 2008387 and MRI #1725729, NIFA award 2020-67021-32799, UIUC, Samsung, Amazon, 3M, Cisco Systems Inc. (Gift Award CG 1377144), and a Google PhD Fellowship to RY
Study subjects and analysis
bootstrap samples: 10000
All experiments are repeated for five runs with different random seeds. All the plots in Fig. 5 include mean of the five runs and the 95% confidence interval obtained by using the Facebook Bootstrapped implementation with 10,000 bootstrap samples. For Atari experiments, we follow the conventional ‘no-op’ procedure, i.e., at the beginning of each evaluation episode, the agents perform up to 30 ‘no-op’ actions

Reference
  • M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz. Reinforcement learning through asynchronous advantage actor-critic on a GPU. In Proc. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 2013.
    Google ScholarLocate open access versionFindings
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym, 2016.
    Google ScholarLocate open access versionFindings
  • C. Colas, O. Sigaud, and P.-Y. Oudeyer. GEP-PG: Decoupling exploration and exploitation in deep reinforcement learning algorithms. In Proc. ICML, 2018.
    Google ScholarLocate open access versionFindings
  • L. de Haan and A. Ferreira. Extreme Value Theory: An Introduction. Springer, 2006.
    Google ScholarFindings
  • P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov. OpenAI baselines, 2017.
    Google ScholarLocate open access versionFindings
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proc. ICML, 2018.
    Google ScholarLocate open access versionFindings
  • L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, and M. Michalski. Seed rl: Scalable and efficient deep-rl with accelerated central inference. In arXiv., 2020.
    Google ScholarLocate open access versionFindings
  • R. A. Fisher and L. H. C. Tippett. Limiting forms of the frequency distribution of the largest or smallest member of a sample. Math. Proc. Cambridge Philos. Soc., 1928.
    Google ScholarLocate open access versionFindings
  • S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proc. ICRA, 2017.
    Google ScholarLocate open access versionFindings
  • P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In Proc. AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • U. Jain, L. Weihs, E. Kolve, M. Rastegari, S. Lazebnik, A. Farhadi, A. Schwing, and A. Kembhavi. Two body problem: Collaborative visual task completion. In Proc. CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • U. Jain∗, L. Weihs∗, E. Kolve, A. Farhadi, S. Lazebnik, A. Kembhavi, and A. G. Schwing. A Cordial Sync: Going Beyond Marginal Policies For Multi-Agent Embodied Tasks. In Proc. ECCV, 2020. ∗ equal contribution.
    Google ScholarLocate open access versionFindings
  • I. Kostrikov. Pytorch implementations of reinforcement learning algorithms, 2018.
    Google ScholarFindings
  • K. Kurach, A. Raichuk, P. Stanczyk, M. Zajac, O. Bachem, L. Espeholt, C. Riquelme, D. Vincent, M. Michalski, O. Bousquet, and S. Gelly. Google research football: A novel reinforcement learning environment. arXiv., 2019.
    Google ScholarLocate open access versionFindings
  • H. Küttler, N. Nardelli, T. Lavril, M. Selvatici, V. Sivakumar, T. Rocktäschel, and E. Grefenstette. TorchBeast: A PyTorch Platform for Distributed RL. arXiv., 2019.
    Google ScholarFindings
  • J. Langford, A. Smola, and M. Zinkevich. Slow learners are fast. arXiv preprint arXiv:0911.0491, 2009.
    Findings
  • S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. In Proc. ICRA, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Li, I.-J. Liu, Y. Yuan, D. Chen, A. Schwing, and J. Huang. Accelerating distributed reinforcement learning with in-switch computing. In Proc. ISCA, 2019.
    Google ScholarLocate open access versionFindings
  • I.-J. Liu, J. Peng, and A. Schwing. Knowledge flow: Improve upon your teachers. In Proc. ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • I.-J. Liu, R. A. Yeh, and A. G. Schwing. Pic: Permutation invariant critic for multi-agent deep reinforcement learning. In Proc. CoRL, 2019.
    Google ScholarLocate open access versionFindings
  • J. Luo, E. Solowjow, C. Wen, J. A. Ojea, A. M. Agogino, A. Tamar, and P. Abbeel. Reinforcement learning on variable impedance controller for high-precision robotic assembly. In Proc. ICRA, 2019.
    Google ScholarLocate open access versionFindings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NeurIPS Deep Learning Workshop, 2013.
    Google ScholarFindings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
    Google ScholarLocate open access versionFindings
  • V. Mnih, Adrià, P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proc. ICML, 2016.
    Google ScholarLocate open access versionFindings
  • J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policy optimization. In Proc. ICML, 2015.
    Google ScholarLocate open access versionFindings
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. In arxiv, 2017.
    Google ScholarLocate open access versionFindings
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. Nature, 2017.
    Google ScholarLocate open access versionFindings
  • A. Stooke and P. Abbeel. rlpyt: A research code base for deep reinforcement learning in pytorch. In arXiv., 2019.
    Google ScholarLocate open access versionFindings
  • R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2018.
    Google ScholarFindings
  • R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proc. NeurIPS, 2000.
    Google ScholarLocate open access versionFindings
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019.
    Google ScholarLocate open access versionFindings
  • Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient actor-critic with experience replay. In Proc. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra. DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. In Proc. ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In Proc. NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
Author
Your rating :
0

 

Tags
Comments
小科