Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views126
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We prove that ReBeL computes to an approximate Nash equilibrium in two-player zero-sum games and demonstrate that it produces superhuman performance in the benchmark game of heads-up no-limit Texas hold’em

Abstract:

The combination of deep reinforcement learning and search at both training and test time is a powerful paradigm that has led to a number of a successes in single-agent settings and perfect-information games, best exemplified by the success of AlphaZero. However, algorithms of this form have been unable to cope with imperfect-information...More
0
Introduction
  • Combining reinforcement learning with search at both training and test time (RL+Search) has led to a number of major successes in AI in recent years.
  • Existing RL+Search algorithms do not work in imperfect-information games because they make a number of assumptions that no longer hold in these settings.
  • An example of this is illustrated in Figure 1a, which shows a modified form of Rock-Paper-Scissors in which the winner receives two points when either player chooses Scissors [16].
  • There is insufficient not know which node they are in
Highlights
  • Combining reinforcement learning with search at both training and test time (RL+Search) has led to a number of major successes in AI in recent years
  • This paper introduces ReBeL (Recursive Belief-based Learning), a general RL+Search algorithm that converges to a Nash equilibrium in two-player zero-sum games
  • Just as one can compute an optimal policy in perfect-information games via search by learning a value function for world states, we show that one can compute an optimal policy in imperfect-information games via search by learning a value function V1 : B → R, where B is the continuous space of public belief state (PBS)
  • We evaluate our techniques on turn endgame hold’em (TEH), a variant of no-limit Texas hold’em in which both players automatically check/call for the first two of the four betting rounds in the game
  • We present ReBeL, an algorithm that generalizes the paradigm of self-play reinforcement learning and search to imperfect-information games
  • We prove that ReBeL computes to an approximate Nash equilibrium in two-player zero-sum games and demonstrate that it produces superhuman performance in the benchmark game of heads-up no-limit Texas hold’em
Methods
  • The authors measure exploitability of a policy π∗, which is i∈N maxπ vi(π, π−∗ i)/|N |.
  • All CFR experiments use alternating-updates Linear CFR [14].
  • All FP experiments use alternating-updates Linear Optimistic FP, which is a novel variant the authors present in Appendix D.
  • The authors evaluate on the benchmark imperfect-information games of heads-up no-limit Texas hold’em poker (HUNL) and Liar’s Dice.
  • The rules for both games are provided in Appendix C.
  • The authors evaluate the techniques on turn endgame hold’em (TEH), a variant of no-limit Texas hold’em in which both players automatically check/call for the first two of the four betting rounds in the game
Results
  • Top poker agents typically use between 100 and 1,000 tabular CFR iterations [5, 42, 13, 16, 15].
  • Table 1 shows results for ReBeL in HUNL.
  • The authors present results against Dong Kim, a top human HUNL expert that did best among the four top humans that played against Libratus.
  • Kim played 7,500 hands.
  • ReBeL played faster than 2 seconds per hand and never needed more than 5 seconds for a decision
Conclusion
  • The authors present ReBeL, an algorithm that generalizes the paradigm of self-play reinforcement learning and search to imperfect-information games.
  • The authors prove that ReBeL computes to an approximate Nash equilibrium in two-player zero-sum games and demonstrate that it produces superhuman performance in the benchmark game of heads-up no-limit Texas hold’em.
  • ReBeL has some limitations that present avenues for future research.
  • The input to its value and policy functions currently grows linearly with the number of infostates in a public state.
  • ReBeL’s theoretical guarantees are limited only to two-player zero-sum games
Summary
  • Introduction:

    Combining reinforcement learning with search at both training and test time (RL+Search) has led to a number of major successes in AI in recent years.
  • Existing RL+Search algorithms do not work in imperfect-information games because they make a number of assumptions that no longer hold in these settings.
  • An example of this is illustrated in Figure 1a, which shows a modified form of Rock-Paper-Scissors in which the winner receives two points when either player chooses Scissors [16].
  • There is insufficient not know which node they are in
  • Objectives:

    The authors' goal is to develop a simple, flexible, effective algorithm that leverages as little expert domain knowledge as possible.
  • Methods:

    The authors measure exploitability of a policy π∗, which is i∈N maxπ vi(π, π−∗ i)/|N |.
  • All CFR experiments use alternating-updates Linear CFR [14].
  • All FP experiments use alternating-updates Linear Optimistic FP, which is a novel variant the authors present in Appendix D.
  • The authors evaluate on the benchmark imperfect-information games of heads-up no-limit Texas hold’em poker (HUNL) and Liar’s Dice.
  • The rules for both games are provided in Appendix C.
  • The authors evaluate the techniques on turn endgame hold’em (TEH), a variant of no-limit Texas hold’em in which both players automatically check/call for the first two of the four betting rounds in the game
  • Results:

    Top poker agents typically use between 100 and 1,000 tabular CFR iterations [5, 42, 13, 16, 15].
  • Table 1 shows results for ReBeL in HUNL.
  • The authors present results against Dong Kim, a top human HUNL expert that did best among the four top humans that played against Libratus.
  • Kim played 7,500 hands.
  • ReBeL played faster than 2 seconds per hand and never needed more than 5 seconds for a decision
  • Conclusion:

    The authors present ReBeL, an algorithm that generalizes the paradigm of self-play reinforcement learning and search to imperfect-information games.
  • The authors prove that ReBeL computes to an approximate Nash equilibrium in two-player zero-sum games and demonstrate that it produces superhuman performance in the benchmark game of heads-up no-limit Texas hold’em.
  • ReBeL has some limitations that present avenues for future research.
  • The input to its value and policy functions currently grows linearly with the number of infostates in a public state.
  • ReBeL’s theoretical guarantees are limited only to two-player zero-sum games
Tables
  • Table1: Head-to-head results of our agent against benchmark bots BabyTartanian8 and Slumbot, as well as top human expert Dong Kim, measured in thousandths of a big blind per game. We also show performance against LBR [<a class="ref-link" id="c41" href="#r41">41</a>] where the LBR agent must call for the first two betting rounds, and can either fold, call, bet 1× pot, or bet all-in on the last two rounds. The ± shows one standard deviation. For Libratus, we list the score against all top humans in aggregate; Libratus beat Dong Kim by 29 with an estimated ± of 78
  • Table2: Exploitability of different algorithms of 4 variants of Liar’s Dice: 1 die with 4, 5, or 6 faces and 2 dice with 3 faces. The top two rows represent baseline numbers when a tabular version of the algorithms is run on the entire game for 1,024 iterations. The bottom 2 lines show the performance of ReBeL operating on subgames of depth 2 with 1,024 search iterations. For exploitability computation of the bottom two rows, we averaged the policies of 1,024 playthroughs and thus the numbers are upper bounds on exploitability
Download tables as Excel
Related work
  • At a high level, our framework resembles past RL+Search algorithms used in perfect-information games [59, 56, 1, 55, 51]. These algorithms train a value network through self play. During training, a search algorithm is used in which the values of leaf nodes are determined via the value function. Additionally, a policy network may be used to guide search. These forms of RL+Search have been critical to achieving superhuman performance in benchmark perfect-information games. For example, so far no AI agent has achieved superhuman performance in Go without using search at both training and test time. However, these RL+Search algorithms are not theoretically sound in imperfect-information games and have not been shown to be successful in such settings.
Study subjects and analysis
top humans: 4
We compare ReBeL to BabyTartanian8 [10] and Slumbot, prior champions of the Computer Poker Competition, and the local best response (LBR) [41] algorithm. We also present results against Dong Kim, a top human HUNL expert that did best among the four top humans that played against Libratus. Kim played 7,500 hands

Reference
  • Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, pages 5360–5370, 2017.
    Google ScholarLocate open access versionFindings
  • Robert J Aumann. Agreeing to disagree. The annals of statistics, pages 1236–1239, 1976.
    Google ScholarFindings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
    Findings
  • Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, 2015.
    Google ScholarLocate open access versionFindings
  • George W Brown. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13(1):374–376, 1951.
    Google ScholarLocate open access versionFindings
  • Noam Brown, Sam Ganzfried, and Tuomas Sandholm. Hierarchical abstraction, distributed equilibrium computation, and post-processing, with application to a champion no-limit texas hold’em agent. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 7–15. International Foundation for Autonomous Agents and Multiagent Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. In International Conference on Machine Learning, pages 793–802, 2019.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Simultaneous abstraction and equilibrium finding in games. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Baby tartanian8: Winning agent from the 2016 annual computer poker competition. In IJCAI, pages 4238–4239, 2016.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Strategy-based warm starting for regret minimization in games. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfectinformation games. In Advances in neural information processing systems, pages 689–699, 2017.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, page eaao1733, 2017.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Solving imperfect-information games via discounted regret minimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1829–1836, 2019.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, page eaay2400, 2019.
    Google ScholarLocate open access versionFindings
  • Noam Brown, Tuomas Sandholm, and Brandon Amos. Depth-limited solving for imperfectinformation games. In Advances in Neural Information Processing Systems, pages 7663–7674, 2018.
    Google ScholarLocate open access versionFindings
  • Neil Burch, Michael Johanson, and Michael Bowling. Solving imperfect information games using decomposition. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
    Google ScholarLocate open access versionFindings
  • Neil Burch, Martin Schmid, Matej Moravcik, Dustin Morill, and Michael Bowling. Aivat: A new variance reduction technique for agent evaluation in imperfect information games. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, pages 6–1, 2012.
    Google ScholarLocate open access versionFindings
  • Jilles Steeve Dibangoye, Christopher Amato, Olivier Buffet, and François Charpillet. Optimally solving dec-pomdps as continuous-state mdps. Journal of Artificial Intelligence Research, 55:443–497, 2016.
    Google ScholarLocate open access versionFindings
  • Jakob Foerster, Francis Song, Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson, Matthew Botvinick, and Michael Bowling. Bayesian action decoder for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pages 1942–1951, 2019.
    Google ScholarLocate open access versionFindings
  • Sam Ganzfried and Tuomas Sandholm. Potential-aware imperfect-recall abstraction with earth mover’s distance in imperfect-information games. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 682–690, 2014.
    Google ScholarLocate open access versionFindings
  • Sam Ganzfried and Tuomas Sandholm. Endgame solving in large imperfect-information games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 37–45. International Foundation for Autonomous Agents and Multiagent Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Sylvain Gelly and David Silver. Combining online and offline knowledge in uct. In Proceedings of the 24th international conference on Machine learning, pages 273–280, 2007.
    Google ScholarLocate open access versionFindings
  • Andrew Gilpin and Tuomas Sandholm. Optimal rhode island hold’em poker. In Proceedings of the 20th national conference on Artificial intelligence-Volume 4, pages 1684–1685, 2005.
    Google ScholarLocate open access versionFindings
  • Andrew Gilpin and Tuomas Sandholm. A competitive texas hold’em poker player via automated abstraction and real-time equilibrium computation. In Proceedings of the National Conference on Artificial Intelligence, volume 21, page 1007. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.
    Google ScholarLocate open access versionFindings
  • Eric A Hansen, Daniel S Bernstein, and Shlomo Zilberstein. Dynamic programming for partially observable stochastic games. In AAAI, volume 4, pages 709–715, 2004.
    Google ScholarLocate open access versionFindings
  • Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
    Findings
  • Samid Hoda, Andrew Gilpin, Javier Pena, and Tuomas Sandholm. Smoothing techniques for computing nash equilibria of sequential games. Mathematics of Operations Research, 35(2):494–512, 2010.
    Google ScholarLocate open access versionFindings
  • Karel Horák and Branislav Bošansky. Solving partially observable stochastic games with public observations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2029–2036, 2019.
    Google ScholarLocate open access versionFindings
  • Michael Johanson, Nolan Bard, Neil Burch, and Michael Bowling. Finding optimal abstract strategies in extensive-form games. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pages 1371–1379, 2012.
    Google ScholarLocate open access versionFindings
  • Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Vojtech Kovarík and Viliam Lisy. Problems with the efg formalism: a solution attempt using observations. arXiv preprint arXiv:1906.06291, 2019.
    Findings
  • Vojtech Kovarík, Martin Schmid, Neil Burch, Michael Bowling, and Viliam Lisy. Rethinking formal models of partially observable multiagent decision making. arXiv preprint arXiv:1906.11110, 2019.
    Findings
  • Christian Kroer, Gabriele Farina, and Tuomas Sandholm. Solving large sequential games with the excessive gap technique. In Advances in Neural Information Processing Systems, pages 864–874, 2018.
    Google ScholarLocate open access versionFindings
  • Christian Kroer, Kevin Waugh, Fatma Kılınç-Karzan, and Tuomas Sandholm. Faster algorithms for extensive-form game solving via improved smoothing functions. Mathematical Programming, pages 1–33, 2018.
    Google ScholarLocate open access versionFindings
  • Adam Lerer, Hengyuan Hu, Jakob Foerster, and Noam Brown. Improving policies via search in cooperative partially observable games. In AAAI Conference on Artificial Intelligence, 2020.
    Google ScholarLocate open access versionFindings
  • David S Leslie and Edmund J Collins. Generalised weakened fictitious play. Games and Economic Behavior, 56(2):285–298, 2006.
    Google ScholarLocate open access versionFindings
  • Viliam Lisy and Michael Bowling. Eqilibrium approximation quality of current no-limit poker bots. In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • Matej Moravcík, Martin Schmid, Neil Burch, Viliam Lisy, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017.
    Google ScholarLocate open access versionFindings
  • Matej Moravcik, Martin Schmid, Karel Ha, Milan Hladik, and Stephen J Gaukrodger. Refining subgames in large imperfect information games. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • John Nash. Non-cooperative games. Annals of mathematics, pages 286–295, 1951.
    Google ScholarLocate open access versionFindings
  • Ashutosh Nayyar, Aditya Mahajan, and Demosthenis Teneketzis. Decentralized stochastic control with partial history sharing: A common information approach. IEEE Transactions on Automatic Control, 58(7):1644–1658, 2013.
    Google ScholarLocate open access versionFindings
  • Andrew J Newman, Casey L Richardson, Sean M Kain, Paul G Stankiewicz, Paul R Guseman, Blake A Schreurs, and Jeffrey A Dunne. Reconnaissance blind multi-chess: an experimentation platform for isr sensor fusion and resource management. In Signal Processing, Sensor/Information Fusion, and Target Recognition XXV, volume 9842, page 984209. International Society for Optics and Photonics, 2016.
    Google ScholarLocate open access versionFindings
  • Frans Adriaan Oliehoek. Sufficient plan-time statistics for decentralized pomdps. In TwentyThird International Joint Conference on Artificial Intelligence, 2013.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pages 8026–8037, 2019.
    Google ScholarLocate open access versionFindings
  • Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. 2013.
    Google ScholarFindings
  • Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3):210–229, 1959.
    Google ScholarLocate open access versionFindings
  • Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
    Findings
  • Dominik Seitz, Vojtech Kovarík, Viliam Lisy, Jan Rudolf, Shuo Sun, and Karel Ha. Value functions for depth-limited solving in imperfect-information games beyond poker. arXiv preprint arXiv:1906.06412, 2019.
    Findings
  • Jack Serrino, Max Kleiman-Weiner, David C Parkes, and Josh Tenenbaum. Finding friend and foe in multi-agent games. In Advances in Neural Information Processing Systems, pages 1249–1259, 2019.
    Google ScholarLocate open access versionFindings
  • Claude E Shannon. Programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, 1950.
    Google ScholarLocate open access versionFindings
  • David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
    Google ScholarLocate open access versionFindings
  • David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
    Google ScholarLocate open access versionFindings
  • Michal Šustr, Vojtech Kovarík, and Viliam Lisy. Monte carlo continual resolving for online strategy computation in imperfect information games. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pages 224–232. International Foundation for Autonomous Agents and Multiagent Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, pages 2989–2997, 2015.
    Google ScholarLocate open access versionFindings
  • Gerald Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215–219, 1994.
    Google ScholarLocate open access versionFindings
  • Ben Van der Genugten. A weakened form of fictitious play in two-person zero-sum games. International Game Theory Review, 2(04):307–328, 2000.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
    Google ScholarLocate open access versionFindings
  • Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in neural information processing systems, pages 1729–1736, 2008.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments