Mastering the game of Go without human knowledge

Julian Schrittwieser
Julian Schrittwieser
Thomas Hubert
Thomas Hubert
Lucas Baker
Lucas Baker
Matthew Lai
Matthew Lai
Adrian Bolton
Adrian Bolton

Nat., pp. 354-359, 2017.

被引用4287|引用|浏览397|DOI:https://doi.org/10.1038/nature24270
WOS EI
其它链接pubmed.ncbi.nlm.nih.gov|academic.microsoft.com|dblp.uni-trier.de
微博一下
Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules

摘要

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural network...更多

代码

数据

0
简介
  • Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts [1,2,3,4].
  • There has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning.
  • These systems have outperformed humans in computer games such as Atari [6,7] and 3D virtual environments [8,9,10].
  • The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning.
  • A subsequent version, which the authors refer to as AlphaGo Lee, used a similar approach, and defeated Lee Sedol, the winner of 18 international titles, in March 2016
重点内容
  • Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts [1,2,3,4]
  • There has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning
  • Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules
  • In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games
  • Elo ratings were computed from the results of a 5 second per move tournament between AlphaGo Zero, AlphaGo Master, AlphaGo Lee, and AlphaGo Fan
  • The Elo ratings of AlphaGo Fan, Crazy Stone, Pachi and GnuGo were anchored to the tournament values from prior work 12, and correspond to the players reported in that work
方法
  • Reinforcement learning Policy iteration [20,21] is a classic algorithm that generates a sequence of improving policies, by alternating between policy evaluation – estimating the value function of the current policy – and policy improvement – using the current value function to generate a better policy.
  • Many rollouts are executed for each action; the action with the maximum mean value provides a positive training example, while all other actions provide negative training examples; a policy is trained to classify actions as positive or negative, and used in subsequent rollouts.
  • This may be viewed as a precursor to the policy component of AlphaGo Zero’s training algorithm when τ → 0
结果
  • The authors evaluated the relative strength of AlphaGo Zero (Figure 3a and 6) by measuring the Elo rating of each player.
  • Elo ratings were computed from the results of a 5 second per move tournament between AlphaGo Zero, AlphaGo Master, AlphaGo Lee, and AlphaGo Fan. The raw neural network from AlphaGo Zero was included in the tournament.
  • The Elo ratings of AlphaGo Fan, Crazy Stone, Pachi and GnuGo were anchored to the tournament values from prior work 12, and correspond to the players reported in that work.
  • The results of the matches of AlphaGo Fan against Fan Hui and AlphaGo Lee against Lee Sedol were included to ground the scale to human references, as otherwise the Elo ratings of AlphaGo are unrealistically high due to self-play bias
结论
  • The authors' results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules.
  • A pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, compared to training on human expert data
  • Using this ap-.
  • In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games
总结
  • Introduction:

    Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts [1,2,3,4].
  • There has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning.
  • These systems have outperformed humans in computer games such as Atari [6,7] and 3D virtual environments [8,9,10].
  • The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning.
  • A subsequent version, which the authors refer to as AlphaGo Lee, used a similar approach, and defeated Lee Sedol, the winner of 18 international titles, in March 2016
  • Methods:

    Reinforcement learning Policy iteration [20,21] is a classic algorithm that generates a sequence of improving policies, by alternating between policy evaluation – estimating the value function of the current policy – and policy improvement – using the current value function to generate a better policy.
  • Many rollouts are executed for each action; the action with the maximum mean value provides a positive training example, while all other actions provide negative training examples; a policy is trained to classify actions as positive or negative, and used in subsequent rollouts.
  • This may be viewed as a precursor to the policy component of AlphaGo Zero’s training algorithm when τ → 0
  • Results:

    The authors evaluated the relative strength of AlphaGo Zero (Figure 3a and 6) by measuring the Elo rating of each player.
  • Elo ratings were computed from the results of a 5 second per move tournament between AlphaGo Zero, AlphaGo Master, AlphaGo Lee, and AlphaGo Fan. The raw neural network from AlphaGo Zero was included in the tournament.
  • The Elo ratings of AlphaGo Fan, Crazy Stone, Pachi and GnuGo were anchored to the tournament values from prior work 12, and correspond to the players reported in that work.
  • The results of the matches of AlphaGo Fan against Fan Hui and AlphaGo Lee against Lee Sedol were included to ground the scale to human references, as otherwise the Elo ratings of AlphaGo are unrealistically high due to self-play bias
  • Conclusion:

    The authors' results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without human examples or guidance, given no knowledge of the domain beyond basic rules.
  • A pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, compared to training on human expert data
  • Using this ap-.
  • In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games
表格
  • Table1: Move prediction accuracy. Percentage accuracies of move prediction for neural networks trained by reinforcement learning (i.e. AlphaGo Zero) or supervised learning respectively. For supervised learning, the network was trained for 3 days on KGS data (amateur games); comparative results are also shown from Silver et al 12. For reinforcement learning, the 20 block network was trained for 3 days and the 40 block network was trained for 40 days. Networks were also evaluated on a validation set based on professional games from the GoKifu data set
  • Table2: Game outcome prediction error. Mean squared error on game outcome predictions for neural networks trained by reinforcement learning (i.e. AlphaGo Zero) or supervised learning respectively. For supervised learning, the network was trained for 3 days on KGS data (amateur games); comparative results are also shown from Silver et al 12. For reinforcement learning, the 20 block network was trained for 3 days and the 40 block network was trained for 40 days. Networks were also evaluated on a validation set based on professional games from the GoKifu data set
  • Table3: Learning rate schedule. Learning rate used during reinforcement learning and supervised learning experiments, measured in thousands of steps (mini-batch updates)
Download tables as Excel
引用论文
  • Friedman, J., Hastie, T. & Tibshirani, R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer-Verlag, 2009).
    Google ScholarFindings
  • LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 1097–1105 (2012).
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016).
    Google ScholarLocate open access versionFindings
  • Hayes-Roth, F., Waterman, D. & Lenat, D. Building expert systems (Addison-Wesley, 1984).
    Google ScholarFindings
  • Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–
    Google ScholarLocate open access versionFindings
  • Guo, X., Singh, S. P., Lee, H., Lewis, R. L. & Wang, X. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In Advances in Neural Information Processing Systems, 3338–3346 (2014).
    Google ScholarLocate open access versionFindings
  • Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937 (2016).
    Google ScholarLocate open access versionFindings
  • Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. International Conference on Learning Representations (2017).
    Google ScholarLocate open access versionFindings
  • Dosovitskiy, A. & Koltun, V. Learning to act by predicting the future. In International Conference on Learning Representations (2017).
    Google ScholarLocate open access versionFindings
  • Mandziuk, J. Computational intelligence in mind games. In Challenges for Computational Intelligence, 407–442 (2007).
    Google ScholarFindings
  • Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
    Google ScholarLocate open access versionFindings
  • Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree search. In International Conference on Computers and Games, 72–83 (2006).
    Google ScholarLocate open access versionFindings
  • Kocsis, L. & Szepesvari, C. Bandit based Monte-Carlo planning. In 15th European Conference on Machine Learning, 282–293 (2006).
    Google ScholarLocate open access versionFindings
  • Browne, C. et al. A survey of Monte-Carlo tree search methods. IEEE Transactions of Computational Intelligence and AI in Games 4, 1–43 (2012).
    Google ScholarLocate open access versionFindings
  • Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics 36, 193–202 (1980).
    Google ScholarLocate open access versionFindings
  • LeCun, Y. & Bengio, Y. Convolutional networks for images, speech, and time series. In Arbib, M. (ed.) The Handbook of Brain Theory and Neural Networks, chap. 3, 276–278 (MIT Press, 1995).
    Google ScholarLocate open access versionFindings
  • Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 448–456 (2015).
    Google ScholarLocate open access versionFindings
  • Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J. & Seung, H. S. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 947–951 (2000).
    Google ScholarLocate open access versionFindings
  • Howard, R. Dynamic Programming and Markov Processes (MIT Press, 1960).
    Google ScholarFindings
  • Sutton, R. & Barto, A. Reinforcement Learning: an Introduction (MIT Press, 1998).
    Google ScholarFindings
  • Bertsekas, D. P. Approximate policy iteration: a survey and some new methods. Journal of Control Theory and Applications 9, 310–335 (2011).
    Google ScholarLocate open access versionFindings
  • Scherrer, B. Approximate policy iteration schemes: A comparison. In International Conference on Machine Learning, 1314–1322 (2014). 16
    Google ScholarLocate open access versionFindings
  • Rosin, C. D. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence 61, 203–230 (2011).
    Google ScholarLocate open access versionFindings
  • Coulom, R. Whole-history rating: A Bayesian rating system for players of time-varying strength. In International Conference on Computers and Games, 113–124 (2008).
    Google ScholarLocate open access versionFindings
  • Laurent, G. J., Matignon, L. & Le Fort-Piat, N. The world of Independent learners is not Markovian. International Journal of Knowledge-Based and Intelligent Engineering Systems 15, 55–64 (2011).
    Google ScholarLocate open access versionFindings
  • Foerster, J. N. et al. Stabilising experience replay for deep multi-agent reinforcement learning. In International Conference on Machine Learning (2017).
    Google ScholarLocate open access versionFindings
  • Heinrich, J. & Silver, D. Deep reinforcement learning from self-play in imperfect-information games. In NIPS Deep Reinforcement Learning Workshop (2016).
    Google ScholarLocate open access versionFindings
  • Jouppi, N. P., Young, C., Patil, N. et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, 1–12 (ACM, 2017).
    Google ScholarLocate open access versionFindings
  • Maddison, C. J., Huang, A., Sutskever, I. & Silver, D. Move evaluation in Go using deep convolutional neural networks. In International Conference on Learning Representations (2015).
    Google ScholarLocate open access versionFindings
  • Clark, C. & Storkey, A. J. Training deep convolutional neural networks to play Go. In International Conference on Machine Learning, 1766–1774 (2015).
    Google ScholarLocate open access versionFindings
  • Tian, Y. & Zhu, Y. Better computer Go player with neural network and long-term prediction. In International Conference on Learning Representations (2016).
    Google ScholarLocate open access versionFindings
  • Cazenave, T. Residual networks for computer Go. IEEE Transactions on Computational Intelligence and AI in Games (2017). AlphaGo Master online series of games (2017).
    Google ScholarLocate open access versionFindings
  • https://deepmind.com/research/alphago/match-archive/master.
    Findings
  • AlphaGo versions We compare three distinct versions of AlphaGo: 1. AlphaGo Fan is the previously published program 12 that played against Fan Hui in October 2015. This program was distributed over many machines using 176 GPUs.
    Google ScholarFindings
  • 2. AlphaGo Lee is the program that defeated Lee Sedol 4–1 in March, 2016. It was previously unpublished but is similar in most regards to AlphaGo Fan 12.
    Google ScholarLocate open access versionFindings
  • 3. AlphaGo Master is the program that defeated top human players by 60–0 in January, 2017 34. It was previously unpublished but uses the same neural network architecture, reinforcement learning algorithm, and MCTS algorithm as described in this paper. However, it uses the same handcrafted features and rollouts as AlphaGo Lee 12 and training was initialised by supervised learning from human data.
    Google ScholarFindings
  • 4. AlphaGo Zero is the program described in this paper. It learns from self-play reinforcement learning, starting from random initial weights, without using rollouts, with no human supervision, and using only the raw board history as input features. It uses just a single machine in the Google Cloud with 4 TPUs (AlphaGo Zero could also be distributed but we chose to use the simplest possible search algorithm).
    Google ScholarFindings
  • 1. AlphaGo Zero is provided with perfect knowledge of the game rules. These are used during MCTS, to simulate the positions resulting from a sequence of moves, and to score any simulations that reach a terminal state. Games terminate when both players pass, or after 19 · 19 · 2 = 722 moves. In addition, the player is provided with the set of legal moves in each position.
    Google ScholarLocate open access versionFindings
  • 2. AlphaGo Zero uses Tromp-Taylor scoring 66 during MCTS simulations and self-play training. This is because human scores (Chinese, Japanese or Korean rules) are not well-defined if the game terminates before territorial boundaries are resolved. However, all tournament and evaluation games were scored using Chinese rules.
    Google ScholarLocate open access versionFindings
  • 3. The input features describing the position are structured as a 19 × 19 image; i.e. the neural network architecture is matched to the grid-structure of the board.
    Google ScholarFindings
  • 4. The rules of Go are invariant under rotation and reflection; this knowledge has been utilised in AlphaGo Zero both by augmenting the data set during training to include rotations and reflections of each position, and to sample random rotations or reflections of the position during MCTS (see Search Algorithm). Aside from komi, the rules of Go are also invariant to colour transposition; this knowledge is exploited by representing the board from the perspective of the current player (see Neural network architecture)
    Google ScholarFindings
  • 2. Batch normalisation 18 We measured the head-to-head performance of AlphaGo Zero against AlphaGo Lee, and the 40 block instance of AlphaGo Zero against AlphaGo Master, using the same player and match conditions as were used against Lee Sedol in Seoul, 2016. Each player received 2 hours of thinking time plus 3 byoyomi periods of 60 seconds per move. All games were scored using Chinese rules with a komi of 7.5 points.
    Google ScholarFindings
  • Data Availability The datasets used for validation and testing are the GoKifu dataset (available from http://gokifu.com/) and the KGS dataset (available from https://u-go.net/gamerecords/).
    Findings
  • 35. Barto, A. G. & Duff, M. Monte Carlo matrix inversion and reinforcement learning. Advances in Neural Information Processing Systems 687–694 (1994).
    Google ScholarLocate open access versionFindings
  • 36. Singh, S. P. & Sutton, R. S. Reinforcement learning with replacing eligibility traces. Machine learning 22, 123–158 (1996).
    Google ScholarLocate open access versionFindings
  • 37. Lagoudakis, M. G. & Parr, R. Reinforcement learning as classification: Leveraging modern classifiers. In International Conference on Machine Learning, 424–431 (2003).
    Google ScholarLocate open access versionFindings
  • 38. Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. & Geist, M. Approximate modified policy iteration and its application to the game of Tetris. Journal of Machine Learning Research 16, 1629–1676 (2015).
    Google ScholarLocate open access versionFindings
  • 39. Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In International Conference on Machine Learning, 157–163 (1994).
    Google ScholarLocate open access versionFindings
  • 40. Enzenberger, M. The integration of a priori knowledge into a Go playing neural network (1996). URL: http://www.cgl.ucsf.edu/go/Programs/neurogo-html/neurogo.html.
    Findings
  • 41. Enzenberger, M. Evaluation in Go by a neural network using soft segmentation. In Advances in Computer Games Conference, 97–108 (2003).
    Google ScholarLocate open access versionFindings
  • 42. Sutton, R. Learning to predict by the method of temporal differences. Machine Learning 3, 9–44 (1988).
    Google ScholarLocate open access versionFindings
  • 43. Schraudolph, N. N., Dayan, P. & Sejnowski, T. J. Temporal difference learning of position evaluation in the game of Go. Advances in Neural Information Processing Systems 817–824 (1994).
    Google ScholarLocate open access versionFindings
  • 44. Silver, D., Sutton, R. & Muller, M. Temporal-difference search in computer Go. Machine Learning 87, 183–219 (2012).
    Google ScholarLocate open access versionFindings
  • 45. Silver, D. Reinforcement Learning and Simulation-Based Search in Computer Go. Ph.D. thesis, University of Alberta, Edmonton, Canada (2009).
    Google ScholarFindings
  • 46. Gelly, S. & Silver, D. Monte-Carlo tree search and rapid action value estimation in computer Go. Artificial Intelligence 175, 1856–1875 (2011).
    Google ScholarLocate open access versionFindings
  • 47. Coulom, R. Computing Elo ratings of move patterns in the game of Go. International Computer Games Association Journal 30, 198–208 (2007).
    Google ScholarLocate open access versionFindings
  • 48. Gelly, S., Wang, Y., Munos, R. & Teytaud, O. Modification of UCT with patterns in MonteCarlo Go. Tech. Rep. 6062, INRIA (2006).
    Google ScholarFindings
  • 49. Baxter, J., Tridgell, A. & Weaver, L. Learning to play chess using temporal differences. Machine Learning 40, 243–263 (2000).
    Google ScholarLocate open access versionFindings
  • 50. Veness, J., Silver, D., Blair, A. & Uther, W. Bootstrapping from game tree search. In Advances in Neural Information Processing Systems, 1937–1945 (2009).
    Google ScholarLocate open access versionFindings
  • 51. Lai, M. Giraffe: Using Deep Reinforcement Learning to Play Chess. Master’s thesis, Imperial College London (2015).
    Google ScholarFindings
  • 52. Schaeffer, J., Hlynka, M. & Jussila, V. Temporal difference learning applied to a highperformance game-playing program. In International Joint Conference on Artificial Intelligence, 529–534 (2001).
    Google ScholarLocate open access versionFindings
  • 53. Tesauro, G. TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6, 215–219 (1994).
    Google ScholarLocate open access versionFindings
  • 54. Buro, M. From simple features to sophisticated evaluation functions. In International Conference on Computers and Games, 126–145 (1999).
    Google ScholarLocate open access versionFindings
  • 55. Sheppard, B. World-championship-caliber Scrabble. Artificial Intelligence 134, 241–275 (2002).
    Google ScholarLocate open access versionFindings
  • 56. Moravcık, M. et al. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science (2017).
    Google ScholarLocate open access versionFindings
  • 57. Tesauro, G. & Galperin, G. On-line policy improvement using Monte-Carlo search. In Advances in Neural Information Processing, 1068–1074 (1996).
    Google ScholarLocate open access versionFindings
  • 58. Tesauro, G. Neurogammon: a neural-network backgammon program. In International Joint Conference on Neural Networks, vol. 3, 33–39 (1990).
    Google ScholarLocate open access versionFindings
  • 59. Samuel, A. L. Some studies in machine learning using the game of checkers II - recent progress. IBM Journal of Research and Development 11, 601–617 (1967).
    Google ScholarLocate open access versionFindings
  • 60. Kober, J., Bagnell, J. A. & Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research 32, 1238–1274 (2013).
    Google ScholarLocate open access versionFindings
  • 61. Zhang, W. & Dietterich, T. G. A reinforcement learning approach to job-shop scheduling. In International Joint Conference on Artificial Intelligence, 1114–1120 (1995).
    Google ScholarLocate open access versionFindings
  • 62. Cazenave, T., Balbo, F. & Pinson, S. Using a Monte-Carlo approach for bus regulation. In International IEEE Conference on Intelligent Transportation Systems, 1–6 (2009).
    Google ScholarLocate open access versionFindings
  • 63. Evans, R. & Gao, J. Deepmind AI reduces Google data centre cooling bill by 40% (2016). URL: https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/.
    Findings
  • 64. Abe, N. et al. Empirical comparison of various reinforcement learning strategies for sequential targeted marketing. In IEEE International Conference on Data Mining, 3–10 (2002).
    Google ScholarLocate open access versionFindings
  • 65. Silver, D., Newnham, L., Barker, D., Weller, S. & McFall, J. Concurrent reinforcement learning from customer interactions. In International Conference on Machine Learning, 924–932 (2013).
    Google ScholarLocate open access versionFindings
  • 66. Tromp, J. Tromp-Taylor rules (1995). URL: http://tromp.github.io/go.html.
    Findings
  • 67. Muller, M. Computer Go. Artificial Intelligence 134, 145–179 (2002).
    Google ScholarLocate open access versionFindings
  • 68. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & de Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104, 148–175 (2016).
    Google ScholarLocate open access versionFindings
  • 69. Segal, R. B. On the scalability of parallel UCT. Computers and Games 6515, 36–47 (2011).
    Google ScholarLocate open access versionFindings
下载 PDF 全文
您的评分 :
0

 

标签
评论