How to Combine Tree-Search Methods in Reinforcement Learning

Yonathan Efroni
Yonathan Efroni
Gal Dalal
Gal Dalal

national conference on artificial intelligence, 2019.

Cited by: 5|Bibtex|Views208
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We show that even when partial policy evaluation is performed and noise is added to it, along with a noisy policy improvement stage, the above Policy Iteration scheme converges with a γh contraction coefficient

Abstract:

Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually, the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e.g. in AlphaZero). Referring to the planning problem as tree search, a reasonable practice in these i...More

Code:

Data:

0
Introduction
  • A significant portion of the Reinforcement Learning (RL) literature regards Policy Iteration (PI) methods.
  • Relying on recent advances in the analysis of multiple-step lookahead policies (Efroni et al 2018a; 2018c), the authors study the convergence of a PI scheme whose improvement stage is h-step greedy with respect to (w.r.t.) the value function, for h > 1
  • Calculating such policies can be done via Dynamic Programming (DP) or other planning methods such as tree search.
  • The authors isolate a sufficient convergence condition which the authors refer to as hgreedy consistency and relate it to previous 1-step greedy relevant literature
Highlights
  • A significant portion of the Reinforcement Learning (RL) literature regards Policy Iteration (PI) methods
  • For the policy improvement stage, theoretical analysis was mostly reserved for policies that are 1-step greedy, while recent prominent implementations of multiplestep greedy policies exhibited promising empirical behavior (Silver et al 2017b; 2017a)
  • Relying on recent advances in the analysis of multiple-step lookahead policies (Efroni et al 2018a; 2018c), we study the convergence of a Policy Iteration scheme whose improvement stage is h-step greedy with respect to (w.r.t.) the value function, for h > 1
  • Calculating such policies can be done via Dynamic Programming (DP) or other planning methods such as tree search
  • We show that even when partial policy evaluation is performed and noise is added to it, along with a noisy policy improvement stage, the above Policy Iteration scheme converges with a γh contraction coefficient
  • Due to the intimate relation between h-Policy Iteration and state-of-the-art Reinforcement Learning algorithms (e.g., (Silver et al 2017b)), we believe the consequences of the presented results could lead to better algorithms in the future
Methods
  • The authors empirically study NC-hm-PI (Section 5) and hm-PI (Section 6) in the exact and approximate cases.
  • The authors conducted the simulations on a simple N × N deterministic grid-world problem with γ = 0.97, as was done in (Efroni et al 2018a).
  • The authors ran the algorithms and counted the total number of calls to the simulator.
  • Each such “call” takes a state-action pair (s, a) as input, and returns the current reward and state.
  • It quantifies the total running time of the algorithm, and not the total number of iterations
Conclusion
  • Summary and Future Work

    In this work, the authors formulated, analyzed and tested two approaches for relaxing the evaluation stage of h-PI – a multiplestep greedy PI scheme.
  • The first approach backs up v and the second backs up T h−1v or T πh T h−1v.
  • The first might seem like the natural choice, the authors showed it performs significantly worse than the second, especially when combined with short-horizon evaluation, i.e., small m or λ.
  • The authors established the non-contracting nature of the algorithms in Section 5, the authors did not prove they would necessarily not converge.
  • The authors believe that further analysis of the non-contracting algorithms is intriguing, especially given their empirical converging behavior in the noiseless case.
  • Understanding when the noncontracting algorithms perform well is of value, since their update rules are much simpler and easier to implement than the contracting ones
Summary
  • Introduction:

    A significant portion of the Reinforcement Learning (RL) literature regards Policy Iteration (PI) methods.
  • Relying on recent advances in the analysis of multiple-step lookahead policies (Efroni et al 2018a; 2018c), the authors study the convergence of a PI scheme whose improvement stage is h-step greedy with respect to (w.r.t.) the value function, for h > 1
  • Calculating such policies can be done via Dynamic Programming (DP) or other planning methods such as tree search.
  • The authors isolate a sufficient convergence condition which the authors refer to as hgreedy consistency and relate it to previous 1-step greedy relevant literature
  • Methods:

    The authors empirically study NC-hm-PI (Section 5) and hm-PI (Section 6) in the exact and approximate cases.
  • The authors conducted the simulations on a simple N × N deterministic grid-world problem with γ = 0.97, as was done in (Efroni et al 2018a).
  • The authors ran the algorithms and counted the total number of calls to the simulator.
  • Each such “call” takes a state-action pair (s, a) as input, and returns the current reward and state.
  • It quantifies the total running time of the algorithm, and not the total number of iterations
  • Conclusion:

    Summary and Future Work

    In this work, the authors formulated, analyzed and tested two approaches for relaxing the evaluation stage of h-PI – a multiplestep greedy PI scheme.
  • The first approach backs up v and the second backs up T h−1v or T πh T h−1v.
  • The first might seem like the natural choice, the authors showed it performs significantly worse than the second, especially when combined with short-horizon evaluation, i.e., small m or λ.
  • The authors established the non-contracting nature of the algorithms in Section 5, the authors did not prove they would necessarily not converge.
  • The authors believe that further analysis of the non-contracting algorithms is intriguing, especially given their empirical converging behavior in the noiseless case.
  • Understanding when the noncontracting algorithms perform well is of value, since their update rules are much simpler and easier to implement than the contracting ones
Reference
  • Baxter, J.; Tridgell, A.; and Weaver, L. 1999. Tdleaf (lambda): Combining temporal difference learning with game-tree search. arXiv preprint cs/9901001.
    Google ScholarFindings
  • Bertsekas, D. P., and Ioffe, S. 1996. Temporal differencesbased policy iteration and applications in neuro-dynamic programming.
    Google ScholarFindings
  • Bertsekas, D. P., and Tsitsiklis, J. N. 1995. Neuro-dynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1. IEEE.
    Google ScholarLocate open access versionFindings
  • Bertsekas, D. P. 2011. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications 9(3):310–335.
    Google ScholarLocate open access versionFindings
  • Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games 4(1):1–43.
    Google ScholarLocate open access versionFindings
  • Efroni, Y.; Dalal, G.; Scherrer, B.; and Mannor, S. 2018a. Beyond the one-step greedy approach in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, 1386–1395.
    Google ScholarLocate open access versionFindings
  • Efroni, Y.; Dalal, G.; Scherrer, B.; and Mannor, S. 2018b. How to combine tree-search methods in reinforcement learning. arXiv preprint arXiv:1809.01843.
    Findings
  • Efroni, Y.; Dalal, G.; Scherrer, B.; and Mannor, S. 2018c. Multiple-step greedy policies in online and approximate reinforcement learning. arXiv preprint arXiv:1805.07956.
    Findings
  • Jiang, D.; Ekwedike, E.; and Liu, H. 2018. Feedback-based tree search for reinforcement learning. In Dy, J., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 2284–2293.
    Google ScholarLocate open access versionFindings
  • Jin, C.; Allen-Zhu, Z.; Bubeck, S.; and Jordan, M. I. 2018. Is q-learning provably efficient? arXiv preprint arXiv:1807.03765.
    Findings
  • Lai, M. 2015. Giraffe: Using deep reinforcement learning to play chess. arXiv preprint arXiv:1509.01549.
    Findings
  • Lanctot, M.; Winands, M. H.; Pepels, T.; and Sturtevant, N. R. 2014. Monte carlo tree search with heuristic evaluations using implicit minimax backups. arXiv preprint arXiv:1406.0486.
    Findings
  • Lesner, B., and Scherrer, B. 2015. Non-stationary approximate modified policy iteration. In International Conference on Machine Learning, 1567–1575.
    Google ScholarLocate open access versionFindings
  • Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937.
    Google ScholarLocate open access versionFindings
  • Munos, R. 2014. From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. Technical report. 130 pages.
    Google ScholarFindings
  • Negenborn, R. R.; De Schutter, B.; Wiering, M. A.; and Hellendoorn, H. 2005. Learning-based model predictive control for markov decision processes. Delft Center for Systems and Control Technical Report 04-021.
    Google ScholarFindings
  • Puterman, M. L., and Shin, M. C. 1978. Modified policy iteration algorithms for discounted markov decision problems. Management Science 24(11):1127–1137.
    Google ScholarLocate open access versionFindings
  • Puterman, M. L. 1994. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
    Google ScholarFindings
  • Scherrer, B. 2013. Performance Bounds for Lambda Policy Iteration and Application to the Game of Tetris. Journal of Machine Learning Research 14:1175–1221.
    Google ScholarLocate open access versionFindings
  • Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. 2017a. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.
    Findings
  • Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017b. Mastering the game of go without human knowledge. Nature 550(7676):354.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S.; Barto, A. G.; et al. 1998. Reinforcement learning: An introduction.
    Google ScholarFindings
  • Tamar, A.; Thomas, G.; Zhang, T.; Levine, S.; and Abbeel, P. 2017. Learning from the hindsight plan–episodic mpc improvement. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, 336–343. IEEE.
    Google ScholarLocate open access versionFindings
  • Veness, J.; Silver, D.; Blair, A.; and Uther, W. 2009. Bootstrapping from game tree search. In Advances in neural information processing systems, 1937–1945.
    Google ScholarFindings
Full Text
Your rating :
0

 

Best Paper
Best Paper of AAAI, 2019
Tags
Comments