# Memory-Augmented Monte Carlo Tree Search

AAAI, 2018.

EI

Weibo:

Abstract:

This paper proposes and evaluates Memory-Augmented Monte Carlo Tree Search (M-MCTS), which provides a new approach to exploit generalization in online real-time search. The key idea of M-MCTS is to incorporate MCTS with a memory structure, where each entry contains information of a particular state. This memory is used to generate an appr...More

Code:

Data:

Introduction

- The key idea of Monte Carlo Tree Search (MCTS) is to construct a search tree of states evaluated by fast Monte Carlo simulations (Coulom 2006).
- Starting from a given game state, many thousands of games are simulated by randomized self-play until an outcome is observed.
- The state value is estimated as the mean outcome of the simulations.
- With large state spaces, the accuracy of value estimation cannot be effectively guaranteed, since the mean value estimation is likely to have high variance under relatively limited search time.
- Inaccurate estimation can mislead building the search tree and severely degrade the performance of the program

Highlights

- The key idea of Monte Carlo Tree Search (MCTS) is to construct a search tree of states evaluated by fast Monte Carlo simulations (Coulom 2006)
- We first study how the parameters M and τ can affect the performance of Memory-Augmented Monte Carlo Tree Search, since these two parameters together control the degree of generalization
- We believe the reason is that in this setting Memory-Augmented Monte Carlo Tree Search only focuses on the closest neighbours for generalization, but does not do enough exploration
- The feature representation used in Memory-Augmented Monte Carlo Tree Search reuses a neural network designed for move prediction
- We plan to explore approaches that incorporate feature representation learning with Memory-Augmented Monte Carlo Tree Search in an end-to-end fashion, similar to (Pritzel et al 2017; Graves et al 2016)

Methods

- The authors' implementation applies a deep convolutional neural network (DCNN) from (Clark and Storkey 2015), which is trained for move prediction by professional game records
- It has 8 layers in total, including one convolutional layer with 64 7 × 7 filters, two convolutional layers with 64 5 × 5 filters, two layers with 48 5 × 5 filters, two layers with 32 5 × 5 filters, and one fully connected layer.
- The feature hashing dimension is set to 4096, which gives φ(s) ∈ R4096

Results

- The authors first study how the parameters M and τ can affect the performance of M-MCTS, since these two parameters together control the degree of generalization.
- The best result the authors have is from the setting {M = 50, τ = 0.1}, which achieves a 71% win rate against the baseline with 10,000 simulations per move.
- For M = 20 and M = 50, the performance of M-MCTS scales well with the number of simulations per move with τ = 1 and τ = 0.1.
- For M = 100, M-MCTS does not perform well in any setting of τ , since larger M increases the chance of including less similar states

Conclusion

**Conclusion and Future**

Work

In this paper, the authors present an efficient approach to exploit online generalization during real-time search.- Memory-Augmented Monte Carlo Tree Search (M-MCTS), combines the original MCTS algorithm with a memory framework, to provide a memory-based online value approximation.
- The authors demonstrate that this can improve the performance of MCTS in both theory and practice.
- The authors plan to explore approaches that incorporate feature representation learning with M-MCTS in an end-to-end fashion, similar to (Pritzel et al 2017; Graves et al 2016)

Summary

## Introduction:

The key idea of Monte Carlo Tree Search (MCTS) is to construct a search tree of states evaluated by fast Monte Carlo simulations (Coulom 2006).- Starting from a given game state, many thousands of games are simulated by randomized self-play until an outcome is observed.
- The state value is estimated as the mean outcome of the simulations.
- With large state spaces, the accuracy of value estimation cannot be effectively guaranteed, since the mean value estimation is likely to have high variance under relatively limited search time.
- Inaccurate estimation can mislead building the search tree and severely degrade the performance of the program
## Methods:

The authors' implementation applies a deep convolutional neural network (DCNN) from (Clark and Storkey 2015), which is trained for move prediction by professional game records- It has 8 layers in total, including one convolutional layer with 64 7 × 7 filters, two convolutional layers with 64 5 × 5 filters, two layers with 48 5 × 5 filters, two layers with 32 5 × 5 filters, and one fully connected layer.
- The feature hashing dimension is set to 4096, which gives φ(s) ∈ R4096
## Results:

The authors first study how the parameters M and τ can affect the performance of M-MCTS, since these two parameters together control the degree of generalization.- The best result the authors have is from the setting {M = 50, τ = 0.1}, which achieves a 71% win rate against the baseline with 10,000 simulations per move.
- For M = 20 and M = 50, the performance of M-MCTS scales well with the number of simulations per move with τ = 1 and τ = 0.1.
- For M = 100, M-MCTS does not perform well in any setting of τ , since larger M increases the chance of including less similar states
## Conclusion:

**Conclusion and Future**

Work

In this paper, the authors present an efficient approach to exploit online generalization during real-time search.- Memory-Augmented Monte Carlo Tree Search (M-MCTS), combines the original MCTS algorithm with a memory framework, to provide a memory-based online value approximation.
- The authors demonstrate that this can improve the performance of MCTS in both theory and practice.
- The authors plan to explore approaches that incorporate feature representation learning with M-MCTS in an end-to-end fashion, similar to (Pritzel et al 2017; Graves et al 2016)

Related work

- The idea of utilizing information of similar states has been previously studied in game solver. (Kawano 1996) provided a technique where proofs of similar positions are reused for proving another nodes in a game tree. (Kishimoto and Muller 2004) applied this to provide an efficient Graph History Interaction solution, for solving the game of Checkers and Go.

Memory architectures for neural networks and reinforcement learning have been recently described in Memory Networks (Weston, Chopra, and Bordes 2015), Differentiable Neural Computers (Graves et al 2016), Matching Network (Vinyals et al 2016) and Neural Episodic Control (NEC) (Pritzel et al 2017). The most similar work with our MMCTS algorithm is NEC, which applies a memory framework to provide action value function approximation in reinforcement learning. The memory architecture and addressing method are similar to ours. In contrast to their work, we provide theoretical analysis about how the memory can affect value estimation. Furthermore, to our best knowledge, this work is the first one to apply a memory architecture in MCTS.

Funding

- This research was supported by NSERC, the Natural Sciences and Engineering Research Council of Canada

Reference

- Boucheron, S.; Lugosi, G.; and Massart, P. 2013. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press.
- Charikar, M. S. 200Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, 380–388. ACM.
- Childs, B. E.; Brodeur, J. H.; and Kocsis, L. 2008. Transpositions and move groups in Monte Carlo tree search. In IEEE Symposium On Computational Intelligence and Games, 2008., 389–395.
- Clark, C., and Storkey, A. J. 2015. Training deep convolutional neural networks to play Go. In Bach, F. R., and Blei, D. M., eds., Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, volume 37 of JMLR Proceedings, 1766–177JMLR.org.
- Coulom, R. 2006. Efficient selectivity and backup operators in Monte-Carlo tree search. In van den Herik, J.; Ciancarini, P.; and Donkers, H., eds., Proceedings of the 5th International Conference on Computer and Games, volume 4630/2007 of Lecture Notes in Computer Science, 72–83.
- Enzenberger, M., and Muller, M. 2008-2017. Fuego. http://fuego.sourceforge.net.
- Friedman, J.; Hastie, T.; and Tibshirani, R. 2001. The elements of statistical learning, volume 1. Springer series in statistics, Springer, Berlin.
- Gelly, S., and Silver, D. 2007. Combining online and offline knowledge in UCT. In ICML ’07: Proceedings of the 24th international conference on Machine learning, 273– 280. ACM.
- Gelly, S., and Silver, D. 2011. Monte-Carlo Tree Search and Rapid Action Value Estimation in computer Go. Artificial Intelligence 175(11):1856–1875.
- Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka, I.; Grabska-Barwinska, A.; Colmenarejo, S. G.; Grefenstette, E.; Ramalho, T.; Agapiou, J.; et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538(7626):471–476.
- Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energy-based policies. In Proceedings of the 34nd International Conference on Machine Learning, ICML 2017, Sydney, Australia, 6-11 August 2017.
- Kawano, Y. 1996. Using similar positions to search game trees. In Nowakowski, R. J., ed., Games of No Chance, volume 29 of MSRI Publications, 193–202. Cambridge University Press.
- Kishimoto, A., and Muller, M. 2004. A general solution to the graph history interaction problem. In Nineteenth National Conference on Artificial Intelligence (AAAI 2004), 644–649.
- Kocsis, L., and Szepesvari, C. 2006. Bandit based Monte-Carlo planning. In Furnkranz, J.; Scheffer, T.; and Spiliopoulou, M., eds., Machine Learning: ECML 2006, volume 4212 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg. 282–293.
- Muller, M. 2002. Computer Go. Artificial Intelligence 134(1–2):145–179.
- Nachum, O.; Norouzi, M.; Xu, K.; and Schuurmans, D. 2017. Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892.
- Pritzel, A.; Uria, B.; Srinivasan, S.; Puigdomenech, A.; Vinyals, O.; Hassabis, D.; Wierstra, D.; and Blundell, C. 20Neural episodic control. In Proceedings of the 34nd International Conference on Machine Learning, ICML 2017, Sydney, Australia, 6-11 August 2017.
- Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.
- Silver, D.; Sutton, R.; and Muller, M. 2012. Temporaldifference search in computer Go. Machine Learning 87(2):183–219.
- Srinivasan, S.; Talvitie, E.; Bowling, M. H.; and Szepesvari, C. 2015. Improving exploration in UCT using local manifolds. In AAAI, 3386–3392.
- Tian, Y., and Zhu, Y. 2015. Better computer Go player with neural network and long-term prediction. In International Conference on Learning Representations.
- Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; et al. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, 3630–3638.
- Weinberger, K.; Dasgupta, A.; Langford, J.; Smola, A.; and Attenberg, J. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–1120. ACM. Weston, J.; Chopra, S.; and Bordes, A. 2015. Memory networks. In International Conference on Learning Representations.
- Ziebart, B. D. 2010. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph.D.diss., Carnegie Mellon University.

Full Text

Best Paper

Best Paper of AAAI, 2018

Tags

Comments