# How to Combine Tree-Search Methods in Reinforcement Learning

national conference on artificial intelligence, 2019.

EI

Weibo:

Abstract:

Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually, the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e.g. in AlphaZero). Referring to the planning problem as tree search, a reasonable practice in these i...More

Code:

Data:

Introduction

- A significant portion of the Reinforcement Learning (RL) literature regards Policy Iteration (PI) methods.
- Relying on recent advances in the analysis of multiple-step lookahead policies (Efroni et al 2018a; 2018c), the authors study the convergence of a PI scheme whose improvement stage is h-step greedy with respect to (w.r.t.) the value function, for h > 1
- Calculating such policies can be done via Dynamic Programming (DP) or other planning methods such as tree search.
- The authors isolate a sufficient convergence condition which the authors refer to as hgreedy consistency and relate it to previous 1-step greedy relevant literature

Highlights

- A significant portion of the Reinforcement Learning (RL) literature regards Policy Iteration (PI) methods
- For the policy improvement stage, theoretical analysis was mostly reserved for policies that are 1-step greedy, while recent prominent implementations of multiplestep greedy policies exhibited promising empirical behavior (Silver et al 2017b; 2017a)
- Relying on recent advances in the analysis of multiple-step lookahead policies (Efroni et al 2018a; 2018c), we study the convergence of a Policy Iteration scheme whose improvement stage is h-step greedy with respect to (w.r.t.) the value function, for h > 1
- Calculating such policies can be done via Dynamic Programming (DP) or other planning methods such as tree search
- We show that even when partial policy evaluation is performed and noise is added to it, along with a noisy policy improvement stage, the above Policy Iteration scheme converges with a γh contraction coefficient
- Due to the intimate relation between h-Policy Iteration and state-of-the-art Reinforcement Learning algorithms (e.g., (Silver et al 2017b)), we believe the consequences of the presented results could lead to better algorithms in the future

Methods

- The authors empirically study NC-hm-PI (Section 5) and hm-PI (Section 6) in the exact and approximate cases.
- The authors conducted the simulations on a simple N × N deterministic grid-world problem with γ = 0.97, as was done in (Efroni et al 2018a).
- The authors ran the algorithms and counted the total number of calls to the simulator.
- Each such “call” takes a state-action pair (s, a) as input, and returns the current reward and state.
- It quantifies the total running time of the algorithm, and not the total number of iterations

Conclusion

**Summary and Future Work**

In this work, the authors formulated, analyzed and tested two approaches for relaxing the evaluation stage of h-PI – a multiplestep greedy PI scheme.- The first approach backs up v and the second backs up T h−1v or T πh T h−1v.
- The first might seem like the natural choice, the authors showed it performs significantly worse than the second, especially when combined with short-horizon evaluation, i.e., small m or λ.
- The authors established the non-contracting nature of the algorithms in Section 5, the authors did not prove they would necessarily not converge.
- The authors believe that further analysis of the non-contracting algorithms is intriguing, especially given their empirical converging behavior in the noiseless case.
- Understanding when the noncontracting algorithms perform well is of value, since their update rules are much simpler and easier to implement than the contracting ones

Summary

## Introduction:

A significant portion of the Reinforcement Learning (RL) literature regards Policy Iteration (PI) methods.- Relying on recent advances in the analysis of multiple-step lookahead policies (Efroni et al 2018a; 2018c), the authors study the convergence of a PI scheme whose improvement stage is h-step greedy with respect to (w.r.t.) the value function, for h > 1
- Calculating such policies can be done via Dynamic Programming (DP) or other planning methods such as tree search.
- The authors isolate a sufficient convergence condition which the authors refer to as hgreedy consistency and relate it to previous 1-step greedy relevant literature
## Methods:

The authors empirically study NC-hm-PI (Section 5) and hm-PI (Section 6) in the exact and approximate cases.- The authors conducted the simulations on a simple N × N deterministic grid-world problem with γ = 0.97, as was done in (Efroni et al 2018a).
- The authors ran the algorithms and counted the total number of calls to the simulator.
- Each such “call” takes a state-action pair (s, a) as input, and returns the current reward and state.
- It quantifies the total running time of the algorithm, and not the total number of iterations
## Conclusion:

**Summary and Future Work**

In this work, the authors formulated, analyzed and tested two approaches for relaxing the evaluation stage of h-PI – a multiplestep greedy PI scheme.- The first approach backs up v and the second backs up T h−1v or T πh T h−1v.
- The first might seem like the natural choice, the authors showed it performs significantly worse than the second, especially when combined with short-horizon evaluation, i.e., small m or λ.
- The authors established the non-contracting nature of the algorithms in Section 5, the authors did not prove they would necessarily not converge.
- The authors believe that further analysis of the non-contracting algorithms is intriguing, especially given their empirical converging behavior in the noiseless case.
- Understanding when the noncontracting algorithms perform well is of value, since their update rules are much simpler and easier to implement than the contracting ones

Reference

- Baxter, J.; Tridgell, A.; and Weaver, L. 1999. Tdleaf (lambda): Combining temporal difference learning with game-tree search. arXiv preprint cs/9901001.
- Bertsekas, D. P., and Ioffe, S. 1996. Temporal differencesbased policy iteration and applications in neuro-dynamic programming.
- Bertsekas, D. P., and Tsitsiklis, J. N. 1995. Neuro-dynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1. IEEE.
- Bertsekas, D. P. 2011. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications 9(3):310–335.
- Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games 4(1):1–43.
- Efroni, Y.; Dalal, G.; Scherrer, B.; and Mannor, S. 2018a. Beyond the one-step greedy approach in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, 1386–1395.
- Efroni, Y.; Dalal, G.; Scherrer, B.; and Mannor, S. 2018b. How to combine tree-search methods in reinforcement learning. arXiv preprint arXiv:1809.01843.
- Efroni, Y.; Dalal, G.; Scherrer, B.; and Mannor, S. 2018c. Multiple-step greedy policies in online and approximate reinforcement learning. arXiv preprint arXiv:1805.07956.
- Jiang, D.; Ekwedike, E.; and Liu, H. 2018. Feedback-based tree search for reinforcement learning. In Dy, J., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 2284–2293.
- Jin, C.; Allen-Zhu, Z.; Bubeck, S.; and Jordan, M. I. 2018. Is q-learning provably efficient? arXiv preprint arXiv:1807.03765.
- Lai, M. 2015. Giraffe: Using deep reinforcement learning to play chess. arXiv preprint arXiv:1509.01549.
- Lanctot, M.; Winands, M. H.; Pepels, T.; and Sturtevant, N. R. 2014. Monte carlo tree search with heuristic evaluations using implicit minimax backups. arXiv preprint arXiv:1406.0486.
- Lesner, B., and Scherrer, B. 2015. Non-stationary approximate modified policy iteration. In International Conference on Machine Learning, 1567–1575.
- Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937.
- Munos, R. 2014. From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. Technical report. 130 pages.
- Negenborn, R. R.; De Schutter, B.; Wiering, M. A.; and Hellendoorn, H. 2005. Learning-based model predictive control for markov decision processes. Delft Center for Systems and Control Technical Report 04-021.
- Puterman, M. L., and Shin, M. C. 1978. Modified policy iteration algorithms for discounted markov decision problems. Management Science 24(11):1127–1137.
- Puterman, M. L. 1994. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- Scherrer, B. 2013. Performance Bounds for Lambda Policy Iteration and Application to the Game of Tetris. Journal of Machine Learning Research 14:1175–1221.
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. 2017a. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017b. Mastering the game of go without human knowledge. Nature 550(7676):354.
- Sutton, R. S.; Barto, A. G.; et al. 1998. Reinforcement learning: An introduction.
- Tamar, A.; Thomas, G.; Zhang, T.; Levine, S.; and Abbeel, P. 2017. Learning from the hindsight plan–episodic mpc improvement. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, 336–343. IEEE.
- Veness, J.; Silver, D.; Blair, A.; and Uther, W. 2009. Bootstrapping from game tree search. In Advances in neural information processing systems, 1937–1945.

Full Text

Best Paper

Best Paper of AAAI, 2019

Tags

Comments