Mastering the game of Go with deep neural networks and tree search
Nature, Volume 529, Issue 7587, 2016, Pages 484-489.
WOS NATURE EI
Effective move selection and position evaluation functions for Go, based on deep neural networks that are trained by a novel combination of supervised and reinforcement learning
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses 'value networks' to evaluate board positions and 'policy networks' to select moves...更多
下载 PDF 全文
- The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves.
- We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.
- We train a value network vθ that predicts the winner of games played by the RL policy network against itself.
- AlphaGo win rate (%) Mean squared error on expert games sampled state-action pairs (s, a), using stochastic gradient ascent to maximize the likelihood of the human move a selected in state s
- We evaluated the performance of the RL policy network in game play, sampling each move at ~ pρ (⋅|st) from its output probability distribution over actions.
- Reinforcement learning of value networks The final stage of the training pipeline focuses on position evaluation, estimating a value function vp(s) that predicts the outcome from position s of games played by using policy p for both players28–30 v p(s) = E[zt|st = s, at...T ~ p]
- The leaf node is evaluated in two very different ways: first, by the value network vθ; and second, by the outcome zL of a random rollout played out until terminal step T using the fast rollout policy pπ; these evaluations are combined, using a mixing parameter λ, into a leaf evaluation V
- Even without rollouts AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte Carlo evaluation in Go. the mixed evaluation (λ = 0.5) performed best, winning ≥95% of games against other variants.
- Discussion In this work we have developed a Go program, based on a combination of deep neural networks and tree search, that plays at the level of the strongest human players, thereby achieving one of artificial intelligence’s “grand challenges”31–33.
- For the first time, effective move selection and position evaluation functions for Go, based on deep neural networks that are trained by a novel combination of supervised and reinforcement learning.
- During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue did in its chess match against Kasparov4; compensating by selecting those positions more intelligently, using the policy network, and evaluating them more precisely, using the value network—an approach that is perhaps closer to how humans play.
- By combining tree search with policy and value networks, AlphaGo has reached a professional level in Go, providing hope that human-level performance can be achieved in other seemingly intractable artificial intelligence domains.
- We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0
- When played head-to-head, the RL policy network won more than 80% of games against the SL policy network
下载 PDF 全文