Constrained Policy Improvement for Efficient Reinforcement Learning

IJCAI 2020, pp. 2863-2871, 2020.

Cited by: 0|Bibtex|Views90|DOI:https://doi.org/10.24963/ijcai.2020/396
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We introduced a constrained policy improvement method, termed Rerouted Behavior Improvement, designed to generate safe policy updates in the presence of common estimation errors of the Q-function

Abstract:

We propose a policy improvement algorithm for Reinforcement Learning (RL) termed Rerouted Behavior Improvement (RBI). RBI is designed to take into account the evaluation errors of the Q-function. Such errors are common in RL when learning the Q-value from finite experience data. Greedy policies or even constrained policy optimization algo...More

Code:

Data:

0
Introduction
  • While Deep Reinforcement Learning (DRL) is the backbone of many Artificial Intelligence breakthroughs [Silver et al, 2017; OpenAI, 2018 accessed May 2020], factors such as safety and data efficiency may inhibit deployment of RL systems to realworld tasks.
  • An ε-greedy policy improvement is known to have a higher regret than other methods such as Upper Confidence Bound (UCB) [Auer et al, 2002].
  • The latter is much more challenging to adjust to a deep learning framework [Bellemare et al, 2016].
  • This transformation from the countable state space of bandit and grid-world problems to the uncountable state-space in a DRL framework, calls for efficient improvement methods that fit into existing deep learning frameworks
Highlights
  • While Deep Reinforcement Learning (DRL) is the backbone of many Artificial Intelligence breakthroughs [Silver et al, 2017; OpenAI, 2018 accessed May 2020], factors such as safety and data efficiency may inhibit deployment of Reinforcement Learning systems to realworld tasks
  • An ε-greedy policy improvement is known to have a higher regret than other methods such as Upper Confidence Bound (UCB) [Auer et al, 2002]
  • Rerouted Behavior Improvement is designed for safe learning from a batch of experience, yet we show that it increases data efficiency with respect to a greedy step and other constraints such as the Total Variation (TV) [Kakade and Langford, 2002] and PPO [Schulman et al, 2017]
  • We introduced a constrained policy improvement method, termed Rerouted Behavior Improvement, designed to generate safe policy updates in the presence of common estimation errors of the Q-function
  • Learning from Observation settings, where one has no access to new samples and off-policy Reinforcement Learning fails, the Rerouted Behavior Improvement update can safely improve upon naive behavioral cloning
  • We found out that the Rerouted Behavior Improvement updates are more data-efficient than greedy and other constrained policies in training Reinforcement Learning agents
Methods
  • Pacman Qbert Revenge SI

    Humans Behavioral RBI(0.5, 1.5) RBI(0.25, 1.75) RBI(0, 2) TV(0.25) PPO(0.5) DQfD

    First, the authors found out that behavioral cloning, i.e., merely playing with the calculated average behavior β, generally yielded good results with the exception of MsPacman, which is known to be a harder game.
  • For Qbert, the behavioral score was much better than the average score, and the authors assume that this is because good episodes tend to be longer in Qbert.
  • Unlike reroute(0.5, 1.5), a TV constrained update obtained lower performance than the behavioral cloning in all games.
  • At first glance, this may be surprising evidence, but it is expected after analyzing Eq (2).
  • Reroute(0.5, 1.5) always increased the behavioral score and provided the overall best performance
Conclusion
  • The authors introduced a constrained policy improvement method, termed RBI, designed to generate safe policy updates in the presence of common estimation errors of the Q-function.
  • In. Learning from Observation settings, where one has no access to new samples and off-policy RL fails, the RBI update can safely improve upon naive behavioral cloning.
  • To train parametrized policies with the RBI updates, the authors designed a twophase method: an actor solves a non-parametrized constrained optimization problem (Eq (4)) while a learner imitates the actor’s policy with a parametrized network.
Summary
  • Introduction:

    While Deep Reinforcement Learning (DRL) is the backbone of many Artificial Intelligence breakthroughs [Silver et al, 2017; OpenAI, 2018 accessed May 2020], factors such as safety and data efficiency may inhibit deployment of RL systems to realworld tasks.
  • An ε-greedy policy improvement is known to have a higher regret than other methods such as Upper Confidence Bound (UCB) [Auer et al, 2002].
  • The latter is much more challenging to adjust to a deep learning framework [Bellemare et al, 2016].
  • This transformation from the countable state space of bandit and grid-world problems to the uncountable state-space in a DRL framework, calls for efficient improvement methods that fit into existing deep learning frameworks
  • Methods:

    Pacman Qbert Revenge SI

    Humans Behavioral RBI(0.5, 1.5) RBI(0.25, 1.75) RBI(0, 2) TV(0.25) PPO(0.5) DQfD

    First, the authors found out that behavioral cloning, i.e., merely playing with the calculated average behavior β, generally yielded good results with the exception of MsPacman, which is known to be a harder game.
  • For Qbert, the behavioral score was much better than the average score, and the authors assume that this is because good episodes tend to be longer in Qbert.
  • Unlike reroute(0.5, 1.5), a TV constrained update obtained lower performance than the behavioral cloning in all games.
  • At first glance, this may be surprising evidence, but it is expected after analyzing Eq (2).
  • Reroute(0.5, 1.5) always increased the behavioral score and provided the overall best performance
  • Conclusion:

    The authors introduced a constrained policy improvement method, termed RBI, designed to generate safe policy updates in the presence of common estimation errors of the Q-function.
  • In. Learning from Observation settings, where one has no access to new samples and off-policy RL fails, the RBI update can safely improve upon naive behavioral cloning.
  • To train parametrized policies with the RBI updates, the authors designed a twophase method: an actor solves a non-parametrized constrained optimization problem (Eq (4)) while a learner imitates the actor’s policy with a parametrized network.
Tables
  • Table1: Learning to play Atari from a dataset of human players: Final scores table
  • Table2: Final scores table
Download tables as Excel
Related work
  • Many different algorithms have been suggested to address the problems of efficiency and safety in RL. For safety, [Kakade and Langford, 2002; Pirotta et al, 2013a] introduced the concept of constrained policy optimization in RL for guaranteed monotonic improvement. TRPO [Schulman et al, 2015] adopted it to NN parametrized policies, and its successor PPO [Schulman et al, 2017] established better empirical results with a much simpler algorithm. More recent constrained policy iterations are Smoothing Policies [Papini et al, 2019] and optimization via Importance Sampling [Metelli et al, 2018]. However, these algorithms assume that the Q-function is known, and the safety issue arises due to the step size in the gradient optimization. While they provide improvement guarantees when the Q-value is known, they do not address the problem of imperfect Q-value approximation.
Funding
  • ∗This work was supported by the Israel Innovation Authority
Reference
  • [Argall et al., 2009] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
    Google ScholarLocate open access versionFindings
  • [Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
    Google ScholarLocate open access versionFindings
  • [Bellemare et al., 2016] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • [Dann and Brunskill, 2015] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
    Google ScholarLocate open access versionFindings
  • [Fruit et al., 2017] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Emma Brunskill. Regret minimization in mdps with options without prior knowledge. In Advances in Neural Information Processing Systems, pages 3166– 3176, 2017.
    Google ScholarLocate open access versionFindings
  • [Fujimoto et al., 2018] Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
    Findings
  • [Garcıa and Fernandez, 2015] Javier Garcıa and Fernando Fernandez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
    Google ScholarLocate open access versionFindings
  • [Haarnoja et al., 2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • [Hasselt, 2010] Hado V Hasselt. Double q-learning. In Advances in Neural Information Processing Systems, pages 2613–2621, 2010.
    Google ScholarLocate open access versionFindings
  • [Hessel et al., 2017] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298, 2017.
    Findings
  • [Hester et al., 2018] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • [Horgan et al., 2018] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
    Findings
  • [Kakade and Langford, 2002] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
    Google ScholarLocate open access versionFindings
  • [Kearns and Singh, 2000] Michael J Kearns and Satinder P Singh. Bias-variance error bounds for temporal difference updates. In COLT, pages 142–147.
    Google ScholarLocate open access versionFindings
  • [Krause and Ong, 2011] Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455, 2011.
    Google ScholarLocate open access versionFindings
  • [Kurin et al., 2017] Vitaly Kurin, Sebastian Nowozin, Katja Hofmann, Lucas Beyer, and Bastian Leibe. The atari grand challenge dataset. arXiv preprint arXiv:1705.10998, 2017.
    Findings
  • [Metelli et al., 2018] Alberto Maria Metelli, Matteo Papini, Francesco Faccio, and Marcello Restelli. Policy optimization via importance sampling. In Advances in Neural Information Processing Systems, pages 5442–5454, 2018.
    Google ScholarLocate open access versionFindings
  • [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • [OpenAI, 2018 accessed May 2020] OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018-accessed May 2020.
    Findings
  • [Papini et al., 2019] Matteo Papini, Matteo Pirotta, and Marcello Restelli. Smoothing policies and safe policy gradients. arXiv preprint arXiv:1905.03231, 2019.
    Findings
  • [Pirotta et al., 2013a] Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Adaptive step-size for policy gradient methods. In Advances in Neural Information Processing Systems, pages 1394–1402, 2013.
    Google ScholarLocate open access versionFindings
  • [Pirotta et al., 2013b] Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration. In International Conference on Machine Learning, pages 307–315, 2013.
    Google ScholarLocate open access versionFindings
  • [Puterman, 2014] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
    Google ScholarFindings
  • [Schulman et al., 2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
    Google ScholarLocate open access versionFindings
  • [Schulman et al., 2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • [Shalev-Shwartz et al., 2016] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
    Findings
  • [Silver et al., 2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
    Google ScholarLocate open access versionFindings
  • [Sutton and Barto, 2017] RS Sutton and AG Barto. Reinforcement learning: An introduction, (complete draft), 2017.
    Google ScholarFindings
  • [Sutton et al., 2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
    Google ScholarLocate open access versionFindings
  • [Thomas et al., 2015] Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, pages 2380–2388, 2015.
    Google ScholarLocate open access versionFindings
  • [Van Hasselt et al., 2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5.
    Google ScholarLocate open access versionFindings
  • Phoenix, AZ, 2016.
    Google ScholarFindings
  • [Vanderbei and others, 2015] Robert J Vanderbei et al. Linear programming. Springer, 2015.
    Google ScholarFindings
  • [Vuong et al., 2019] Quan Vuong, Yiming Zhang, and Keith W. Ross. SUPERVISED POLICY UPDATE. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2015] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
    Findings
  • [Watkins and Dayan, 1992] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments