Value targets in off-policy AlphaZero: a new greedy backup

NEURAL COMPUTING & APPLICATIONS(2021)

引用 8|浏览11
暂无评分
摘要
This article presents and evaluates a family of AlphaZero value targets, subsuming previous variants and introducing AlphaZero with greedy backups (A0GB) . Current state-of-the-art algorithms for playing board games use sample-based planning, such as Monte Carlo Tree Search (MCTS), combined with deep neural networks (NN) to approximate the value function. These algorithms, of which AlphaZero is a prominent example, are computationally extremely expensive to train, due to their reliance on many neural network evaluations. This limits their practical performance. We improve the training process of AlphaZero by using more effective training targets for the neural network. We introduce a three-dimensional space to describe a family of training targets, covering the original AlphaZero training target as well as the soft- Z and A0C variants from the literature. We demonstrate that A0GB, using a specific new value target from this family, is able to find the optimal policy in a small tabular domain, whereas the original AlphaZero target fails to do so. In addition, we show that soft- Z , A0C and A0GB achieve better performance and faster training than the original AlphaZero target on two benchmark board games (Connect-Four and Breakthrough). Finally, we juxtapose tabular learning with neural network-based value function approximation in Tic-Tac-Toe, and compare the effects of learning targets therein.
更多
查看译文
关键词
Reinforcement learning, Sample-based planning, AlphaZero, MCTS
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要