Combining Q-Learning and Search with Amortized Value Estimates

ICLR, 2020.

Cited by: 7|Bibtex|Views66|Links
EI
Keywords:
model-based RL Q-learning MCTS search
Weibo:
The Q-learning component of SAVE only learns about actions which have been selected via search, and rarely sees highly suboptimal actions, resulting in a poorly approximated Q-function

Abstract:

We introduce "Search with Amortized Value Estimates" (SAVE), an approach for combining model-free Q-learning with model-based Monte-Carlo Tree Search (MCTS). In SAVE, a learned prior over state-action values is used to guide MCTS, which estimates an improved set of state-action values. The new Q-estimates are then used in combination with...More

Code:

Data:

0
Introduction
  • Model-based methods have been at the heart of reinforcement learning (RL) since its inception (Bellman, 1957), and have recently seen a resurgence in the era of deep learning, with powerful function approximators inspiring a variety of effective new approaches (Silver et al, 2018; Chua et al, 2018; Hamrick, 2019; Wang et al, 2019).
  • The authors propose a new method called “Search with Amortized Value Estimates” (SAVE) which uses a combination of real experience as well as the results of past searches to improve overall performance and reduce planning cost.
  • SAVE uses MCTS guided by the learned prior to produce effective behavior, even with very small search budgets and in environments with tens of thousands of possible actions per state—settings which are very challenging for traditional planners.
Highlights
  • Model-based methods have been at the heart of reinforcement learning (RL) since its inception (Bellman, 1957), and have recently seen a resurgence in the era of deep learning, with powerful function approximators inspiring a variety of effective new approaches (Silver et al, 2018; Chua et al, 2018; Hamrick, 2019; Wang et al, 2019)
  • We explore preserving the value estimates that were computed by search by amortizing them via a neural network and using this network to guide future search, resulting in an approach which works well even with very small search budgets
  • We propose a new method called “Search with Amortized Value Estimates” (SAVE) which uses a combination of real experience as well as the results of past searches to improve overall performance and reduce planning cost
  • The Q-learning component of SAVE only learns about actions which have been selected via search, and rarely sees highly suboptimal actions, resulting in a poorly approximated Q-function
  • With a search budget of 10, SAVE effectively sees 10 times as many transitions as a model-free agent trained on the same number of environment interactions
Results
  • These three changes provide a mechanism for incorporating Q-based prior knowledge into MCTS: SAVE acts as if it has visited every state-action pair once, with the estimated values being given by Qθ.
  • The amortization loss does make SAVE more sensitive to off-policy experience, as the values of QMCTS stored in the replay buffer will become less useful and potentially misleading as Qθ improves; the authors did not find this to be an issue in practice.
  • The authors demonstrate through a new Tightrope environment that SAVE performs well in settings where count-based policy approaches struggle, as discussed in Section 2.2.
  • In Section 2.2, the authors hypothesized that approaches which use count-based policy learning rather than value-based learning (e.g. Anthony et al, 2017; Silver et al, 2018) may suffer in environments with large branching factors, many suboptimal actions, and small search budgets.
  • The Q-learning component of SAVE only learns about actions which have been selected via search, and rarely sees highly suboptimal actions, resulting in a poorly approximated Q-function.
  • It is only by leveraging search during training time and incorporating an amortization loss do the authors see a synergistic result: using SAVE results in higher rewards across all tasks, strongly outperforming the other agents.
  • Ablation Experiments In the past two sections, the authors compared SAVE to alternatives which do not include an amortization loss, or which use count-based policy learning rather than value-based learning.
Conclusion
  • With a search budget of 10, SAVE effectively sees 10 times as many transitions as a model-free agent trained on the same number of environment interactions.
  • SAVE leverages 200% MCTS to infer a set of Q-values, and uses a combination of real experience plus the estimated Q-values to fit a Qfunction, amortizing the value computation of previous 100%
  • SAVE can be used to achieve high levels of reward with only very small search budgets, which the authors demonstrate across four distinct domains: Tightrope, Construction (Bapst et al, 2019), Marble Run, and Atari (Bellemare et al, 2013; Kapturowski et al, 2018).
Summary
  • Model-based methods have been at the heart of reinforcement learning (RL) since its inception (Bellman, 1957), and have recently seen a resurgence in the era of deep learning, with powerful function approximators inspiring a variety of effective new approaches (Silver et al, 2018; Chua et al, 2018; Hamrick, 2019; Wang et al, 2019).
  • The authors propose a new method called “Search with Amortized Value Estimates” (SAVE) which uses a combination of real experience as well as the results of past searches to improve overall performance and reduce planning cost.
  • SAVE uses MCTS guided by the learned prior to produce effective behavior, even with very small search budgets and in environments with tens of thousands of possible actions per state—settings which are very challenging for traditional planners.
  • These three changes provide a mechanism for incorporating Q-based prior knowledge into MCTS: SAVE acts as if it has visited every state-action pair once, with the estimated values being given by Qθ.
  • The amortization loss does make SAVE more sensitive to off-policy experience, as the values of QMCTS stored in the replay buffer will become less useful and potentially misleading as Qθ improves; the authors did not find this to be an issue in practice.
  • The authors demonstrate through a new Tightrope environment that SAVE performs well in settings where count-based policy approaches struggle, as discussed in Section 2.2.
  • In Section 2.2, the authors hypothesized that approaches which use count-based policy learning rather than value-based learning (e.g. Anthony et al, 2017; Silver et al, 2018) may suffer in environments with large branching factors, many suboptimal actions, and small search budgets.
  • The Q-learning component of SAVE only learns about actions which have been selected via search, and rarely sees highly suboptimal actions, resulting in a poorly approximated Q-function.
  • It is only by leveraging search during training time and incorporating an amortization loss do the authors see a synergistic result: using SAVE results in higher rewards across all tasks, strongly outperforming the other agents.
  • Ablation Experiments In the past two sections, the authors compared SAVE to alternatives which do not include an amortization loss, or which use count-based policy learning rather than value-based learning.
  • With a search budget of 10, SAVE effectively sees 10 times as many transitions as a model-free agent trained on the same number of environment interactions.
  • SAVE leverages 200% MCTS to infer a set of Q-values, and uses a combination of real experience plus the estimated Q-values to fit a Qfunction, amortizing the value computation of previous 100%
  • SAVE can be used to achieve high levels of reward with only very small search budgets, which the authors demonstrate across four distinct domains: Tightrope, Construction (Bapst et al, 2019), Marble Run, and Atari (Bellemare et al, 2013; Kapturowski et al, 2018).
Related work
Funding
  • Demonstrates by incorporating it into agents that perform challenging physical reasoning tasks and Atari
  • Explores preserving the value estimates that were computed by search by amortizing them via a neural network and using this network to guide future search, resulting in an approach which works well even with very small search budgets
  • Proposes a new method called “Search with Amortized Value Estimates” which uses a combination of real experience as well as the results of past searches to improve overall performance and reduce planning cost
  • Focuses on the single-player setting, notes that the formulation of MCTS is similar for two-player settings
Reference
  • Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.
    Google ScholarLocate open access versionFindings
  • Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, pp. 5360–5370, 2017.
    Google ScholarLocate open access versionFindings
  • Thomas Anthony, Robert Nishihara, Philipp Moritz, Tim Salimans, and John Schulman. Policy gradient search: Online planning and expert iteration without search trees. arXiv preprint arXiv:1904.03646, 2019.
    Findings
  • Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskilland Zachary C Lipton, and Animashree Anandkumar. Surprising negative results for generative adversarial tree search. arXiv preprint arXiv:1806.05780, pp. 1–25, 2018.
    Findings
  • Victor Bapst, Alvaro Sanchez-Gonzalez, Carl Doersch, Kimberly L Stachenfeld, Pushmeet Kohli, Peter W Battaglia, and Jessica B Hamrick. Structured agents for physical construction. In Proceedings of the International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, pp. 1–38, 2018.
    Findings
  • Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253–279, 2013.
    Google ScholarLocate open access versionFindings
  • Richard Bellman. Dynamic programming. Princeton University Press, 1957.
    Google ScholarFindings
  • Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234, 2018.
    Google ScholarLocate open access versionFindings
  • Lars Buesing, Theophane Weber, Sebastien Racaniere, S. M. Ali Eslami, Danilo Rezende, David P. Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, and Daan Wierstra. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, pp. 1–15, 2018.
    Findings
  • Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
    Google ScholarLocate open access versionFindings
  • Gregory Farquhar, Tim Rocktaschel, Maximilian Igl, and Shimon Whiteson. TreeQN and ATreeC: Differentiable tree planning for deep reinforcement learning. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), 2018.
    Google ScholarLocate open access versionFindings
  • Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph E. Gonzalez, and Sergey Levine. Model-based value expansion for efficient model-free reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018.
    Google ScholarLocate open access versionFindings
  • Sylvain Gelly and David Silver. Combining online and offline knowledge in UCT. In Proceedings of the 24th international conference on Machine learning, pp. 273–280. ACM, 2007.
    Google ScholarLocate open access versionFindings
  • Sylvain Gelly and David Silver. Monte-carlo tree search and rapid action value estimation in computer go. Artificial Intelligence, 175(11):1856–1875, 2011.
    Google ScholarLocate open access versionFindings
  • Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep Q-learning with model-based acceleration. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 2016.
    Google ScholarLocate open access versionFindings
  • Arthur Guez, Theophane Weber, Ioannis Antonoglou, Karen Simonyan, Oriol Vinyals, Daan Wierstra, Remi Munos, and David Silver. Learning to search with MCTSnets. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018.
    Google ScholarLocate open access versionFindings
  • Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, et al. An investigation of model-free planning. arXiv preprint arXiv:1901.03559, 2019.
    Findings
  • Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In Advances in neural information processing systems, pp. 3338–3346, 2014.
    Google ScholarLocate open access versionFindings
  • Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
    Findings
  • Jessica B Hamrick. Analogues of mental simulation and imagination in deep learning. Current Opinion in Behavioral Sciences, 29:8–16, 2019.
    Google ScholarLocate open access versionFindings
  • Jessica B. Hamrick, Andrew J. Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Peter W. Battaglia. Metacontrol for adaptive imagination-based optimization. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
    Google ScholarLocate open access versionFindings
  • Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for Atari. arXiv preprint arXiv:1903.00374, 2019.
    Findings
  • Gabriel Kalweit and Joschka Boedecker. Uncertainty-driven imagination for continuous deep reinforcement learning. In Proceedings of the 1st Conference on Robot Learning (CoRL 2017), 2017.
    Google ScholarLocate open access versionFindings
  • Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Peter Karkus, David Hsu, and Wee Sun Lee. QMDP-Net: Deep learning for planning under partial observability. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), 2017.
    Google ScholarLocate open access versionFindings
  • Bilal Kartal, Pablo Hernandez-Leal, and Matthew E Taylor. Action guidance with mcts for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 15, pp. 153–159, 2019.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), 2018.
    Google ScholarLocate open access versionFindings
  • Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn offline: Efficient learning and exploration via model-based control. arXiv preprint arXiv:1811.01848, 2018.
    Findings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
    Google ScholarLocate open access versionFindings
  • Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), 2017.
    Google ScholarLocate open access versionFindings
  • Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sebastien Racaniere, David Reichert, Theophane Weber, Daan Wierstra, and Peter Battaglia. Learning model-based planning from scratch. arXiv preprint arXiv:1707.06170, pp. 1–13, 2017.
    Findings
  • Malcolm Reynolds, Gabriel Barth-Maron, Frederic Besse, Diego de Las Casas, Andreas Fidjeland, Tim Green, Adria Puigdomenech, Sebastien Racaniere, Jack Rae, and Fabio Viola. Open sourcing Sonnet - a new library for constructing neural networks. https://deepmind.com/blog/open-sourcing-sonnet/, 2017.
    Findings
  • Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
    Google ScholarLocate open access versionFindings
  • Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604, 2018.
    Google ScholarLocate open access versionFindings
  • Iulian Vlad Serban, Chinnadhurai Sankar, Michael Pieper, Joelle Pineau, and Yoshua Bengio. The bottleneck simulator: a model-based deep reinforcement learning approach. arXiv preprint arXiv:1807.04723, 2018.
    Findings
  • Yelong Shen, Jianshu Chen, Po-Sen Huang, Yuqing Guo, and Jianfeng Gao. M-Walk: Learning to walk over graphs using monte carlo tree search. In Advances in Neural Information Processing Systems, pp. 6786–6797, 2018.
    Google ScholarLocate open access versionFindings
  • David Silver, Richard S Sutton, and Martin Muller. Sample-based learning and search with permanent and transient memories. In Proceedings of the 25th international conference on Machine learning, pp. 968–975. ACM, 2008.
    Google ScholarLocate open access versionFindings
  • David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484–489, 2016.
    Google ScholarLocate open access versionFindings
  • David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550:354–359, 2017a.
    Google ScholarLocate open access versionFindings
  • David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, and Thomas Degris. The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017b.
    Google ScholarLocate open access versionFindings
  • David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play. Science, 362(6419): 1140–1144, 2018.
    Google ScholarLocate open access versionFindings
  • Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3 (1):9–44, 1988.
    Google ScholarFindings
  • Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning (ICML 1990), 1990.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton and Andrew G. Barto. Reinforcement Learning. MIT Press, 2nd edition, 2018.
    Google ScholarFindings
  • In all experiments except Tabular Tightrope (see Section B.2) and Atari (see Appendix E), we use a distributed training setup with 1 GPU learner and 64 CPU actors. Our setup was implemented using TensorFlow (Abadi et al., 2016) and Sonnet (Reynolds et al., 2017), and gradient descent was performed using the Adam optimizer (Kingma & Ba, 2014) with the TensorFlow default parameter settings (except learning rate).
    Google ScholarLocate open access versionFindings
  • Except for in Atari (see Appendix E), we used a 1-step implementation of Q-learning, with the standard setup with experience replay and a target network (Mnih et al., 2015). We controlled the rate of experience processed by the learner such that the average number of times each transition was replayed (the “replay ratio”) was kept constant. For all experiments, we used a batch size of 16, a learning rate of 0.0002, a replay size of 4000 transitions (with a minimum history of 100 transitions), a replay ratio of 4, and updated the target network every 100 learning steps.
    Google ScholarLocate open access versionFindings
  • We used a variant of epsilon-greedy exploration described by Bapst et al. (2019) in which epsilon is changed adaptively over the course of an episode such that it is lower earlier in the episode and higher later in the episode, with an average value of over the whole episode. We annealed the average value of from 1 to 0.01 over 1e4 episodes.
    Google ScholarLocate open access versionFindings
  • The PUCT search policy is based on that described by Silver et al. (2017a) and Silver et al. (2018). Specifically, we choose actions during search according to Equation 1, with: Qk =
    Google ScholarLocate open access versionFindings
  • a Nk(s, a) Nk(s, a) + 1 where c is an exploration constant, π(s, a) is the prior policy, and Nk(s, a) is the total number of times action a had been taken from state s at iteration k of the search. Like Silver et al. (2017a; 2018), we add Dirichlet noise to the prior policy: π(s, a) = (1 − ) · πθ(s, a) + η, where η ∼ Dir(1/nactions). In our experiments we set = 0.25 and c = 2. During training, after search is complete, we sample an action to execute in the environment from πMCTS(s0, a) = NK (s0, a)/ a NK (s0, a). At test time, we select the action which has the maximum visit count (with random tie-breaking).
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments