Faster Game Solving via Predictive Blackwell Approachability: Connecting Regret Matching and Mirror Descent

Cited by: 0|Bibtex|Views14
Other Links: arxiv.org
Weibo:
online mirror descent applied to the same problem turned out to be equivalent to regret matching+, which is vastly faster than RM in practice

Abstract:

Blackwell approachability is a framework for reasoning about repeated games with vector-valued payoffs. We introduce predictive Blackwell approachability, where an estimate of the next payoff vector is given, and the decision maker tries to achieve better performance based on the accuracy of that estimator. In order to derive algorithms...More

Code:

Data:

0
Introduction
  • Extensive-form games (EFGs) are the standard class of games that can be used to model sequential interaction, outcome uncertainty, and imperfect information.
  • Operationalizing these models requires algorithms for computing game-theoretic equilibria.
  • A recent success of EFGs is the use of Nash equilibrium for several recent poker AI milestones, such as essentially solving the game of limit Texas hold’em [6], and beating top human poker pros in no-limit Texas hold’em with the Libratus AI [7].
Highlights
  • Extensive-form games (EFGs) are the standard class of games that can be used to model sequential interaction, outcome uncertainty, and imperfect information
  • We show that regret matching (RM) and RM+ are the algorithms that result from running FTRL and online mirror descent (OMD), respectively, to select the halfspace to force at all times in the underlying Blackwell approachability game
  • We introduced the notion of predictive Blackwell approachability
  • We showed that predictive FTRL and OMD can be applied to this unbounded setting
  • OMD applied to the same problem turned out to be equivalent to RM+, which is vastly faster than RM in practice
  • Combining predictive regret matching (PRM)+ with CFR, we introduced the PCFR+ algorithm for solving EFGs
Methods
  • The authors conduct experiments on solving two-player zero-sum games. As mentioned previously, for EFGs the CFR framework is used for decomposing regrets into local regret minimization problems at each simplex corresponding to a decision point in the game [42, 16], and the authors do the same.
  • XT using the formula t2 xt, and the authors use alternating updates
  • The authors call this algorithm PCFR+.
  • The experiments shown in the main body are representative of those in the appendix.
  • For all non-predictive algorithms (CFR+, LCFR, and DCFR), the authors let mt = 0.
  • Both y-axes are in log scale
Conclusion
  • Conclusions and Future Research

    The authors introduced the notion of predictive Blackwell approachability.
  • The authors showed that predictive FTRL and OMD can be applied to this unbounded setting.
  • This extended reduction allowed them to show that FTRL applied to the decision of which halfspace to force in Blackwell approachability is equivalent to the regret matching algorithm.
  • The authors showed that the predictive variants of FTRL and OMD yield predictive algorithms for Blackwell approachability, as well as predictive variants of RM and RM+.
  • Can PRM+ guarantee T −1 convergence on matrix games like optimistic FTRL and OMD, or do the less stable updates prevent that? Can one develop a predictive variant of DCFR, which is faster on poker domains? Can one combine DCFR and PCFR+, so DCFR would be faster initially but PCFR+ would overtake? If the cross-over point could be approximated, this might yield a best-of-both-worlds algorithm
Summary
  • Introduction:

    Extensive-form games (EFGs) are the standard class of games that can be used to model sequential interaction, outcome uncertainty, and imperfect information.
  • Operationalizing these models requires algorithms for computing game-theoretic equilibria.
  • A recent success of EFGs is the use of Nash equilibrium for several recent poker AI milestones, such as essentially solving the game of limit Texas hold’em [6], and beating top human poker pros in no-limit Texas hold’em with the Libratus AI [7].
  • Methods:

    The authors conduct experiments on solving two-player zero-sum games. As mentioned previously, for EFGs the CFR framework is used for decomposing regrets into local regret minimization problems at each simplex corresponding to a decision point in the game [42, 16], and the authors do the same.
  • XT using the formula t2 xt, and the authors use alternating updates
  • The authors call this algorithm PCFR+.
  • The experiments shown in the main body are representative of those in the appendix.
  • For all non-predictive algorithms (CFR+, LCFR, and DCFR), the authors let mt = 0.
  • Both y-axes are in log scale
  • Conclusion:

    Conclusions and Future Research

    The authors introduced the notion of predictive Blackwell approachability.
  • The authors showed that predictive FTRL and OMD can be applied to this unbounded setting.
  • This extended reduction allowed them to show that FTRL applied to the decision of which halfspace to force in Blackwell approachability is equivalent to the regret matching algorithm.
  • The authors showed that the predictive variants of FTRL and OMD yield predictive algorithms for Blackwell approachability, as well as predictive variants of RM and RM+.
  • Can PRM+ guarantee T −1 convergence on matrix games like optimistic FTRL and OMD, or do the less stable updates prevent that? Can one develop a predictive variant of DCFR, which is faster on poker domains? Can one combine DCFR and PCFR+, so DCFR would be faster initially but PCFR+ would overtake? If the cross-over point could be approximated, this might yield a best-of-both-worlds algorithm
Funding
  • This material is based on work supported by the National Science Foundation under grants IIS1718457, IIS-1617590, IIS-1901403, and CCF-1733556, and the ARO under awards W911NF-171-0082 and W911NF2010081
  • Gabriele Farina is supported by a Facebook fellowship
Reference
  • Jacob Abernethy, Peter L Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In COLT, pages 27–46, 2011.
    Google ScholarLocate open access versionFindings
  • David Blackwell. Controlled random walks. In Proceedings of the international congress of mathematicians, volume 3, pages 336–338, 1954.
    Google ScholarLocate open access versionFindings
  • David Blackwell. An analog of the minmax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1–8, 1956.
    Google ScholarLocate open access versionFindings
  • B Bošansky, Christopher Kiekintveld, V Lisý, and Michal Pechoucek. An exact double-oracle algorithm for zero-sum extensive-form games with imperfect information. Journal of Artificial Intelligence Research, pages 829–866, 2014.
    Google ScholarLocate open access versionFindings
  • Branislav Bošanskyand Jirí Cermák. Sequence-form algorithm for computing Stackelberg equilibria in extensive-form games. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218), January 2015.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, page eaao1733, Dec. 2017.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Solving imperfect-information games via discounted regret minimization. In AAAI Conference on Artificial Intelligence (AAAI), 2019.
    Google ScholarLocate open access versionFindings
  • Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365 (6456):885–890, 2019.
    Google ScholarLocate open access versionFindings
  • Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning for regret minimization. In AAAI Conference on Artificial Intelligence (AAAI), 2017.
    Google ScholarLocate open access versionFindings
  • Neil Burch. Time and space: Why imperfect information games are hard. 2018.
    Google ScholarFindings
  • Neil Burch, Matej Moravcik, and Martin Schmid. Revisiting CFR+ and alternating updates. Journal of Artificial Intelligence Research, 64:429–443, 2019.
    Google ScholarLocate open access versionFindings
  • Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, pages 6–1, 2012.
    Google ScholarLocate open access versionFindings
  • Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Online convex optimization for sequential decision processes and extensive-form games. In arXiv, 2018.
    Google ScholarLocate open access versionFindings
  • Gabriele Farina, Christian Kroer, Noam Brown, and Tuomas Sandholm. Stable-predictive optimistic counterfactual regret minimization. In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Online convex optimization for sequential decision processes and extensive-form games. In AAAI Conference on Artificial Intelligence, 2019.
    Google ScholarLocate open access versionFindings
  • Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Optimistic regret minimization for extensive-form games via dilated distance-generating functions. In Advances in Neural Information Processing Systems, pages 5222–5232, 2019.
    Google ScholarLocate open access versionFindings
  • Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Regret circuits: Composability of regret minimizers. In International Conference on Machine Learning, pages 1863–1872, 2019.
    Google ScholarLocate open access versionFindings
  • Gabriele Farina, Chun Kai Ling, Fei Fang, and Tuomas Sandholm. Correlation in extensiveform games: Saddle-point formulation and benchmarks. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
    Google ScholarLocate open access versionFindings
  • Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Stochastic regret minimization in extensive-form games. arXiv preprint arXiv:2002.08493, 2020.
    Findings
  • Dean P Foster. A proof of calibration via blackwell’s approachability theorem. Games and Economic Behavior, 29(1-2):73–78, 1999.
    Google ScholarLocate open access versionFindings
  • Yuan Gao, Christian Kroer, and Donald Goldfarb. Increasing iterate averaging for solving saddle-point problems. arXiv preprint arXiv:1903.10646, 2019.
    Findings
  • Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127–1150, 2000.
    Google ScholarLocate open access versionFindings
  • Samid Hoda, Andrew Gilpin, Javier Peña, and Tuomas Sandholm. Smoothing techniques for computing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2), 2010.
    Google ScholarLocate open access versionFindings
  • Christian Kroer, Gabriele Farina, and Tuomas Sandholm. Robust stackelberg equilibria in extensive-form games and extension to limited lookahead. In AAAI Conference on Artificial Intelligence (AAAI), 2018.
    Google ScholarLocate open access versionFindings
  • Christian Kroer, Gabriele Farina, and Tuomas Sandholm. Solving large sequential games with the excessive gap technique. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2018.
    Google ScholarLocate open access versionFindings
  • Christian Kroer, Kevin Waugh, Fatma Kılınç-Karzan, and Tuomas Sandholm. Faster algorithms for extensive-form game solving via improved smoothing functions. Mathematical Programming, 2020.
    Google ScholarLocate open access versionFindings
  • H. W. Kuhn. A simplified two-person poker. In H. W. Kuhn and A. W. Tucker, editors, Contributions to the Theory of Games, volume 1 of Annals of Mathematics Studies, 24, pages 97–103. Princeton University Press, Princeton, New Jersey, 1950.
    Google ScholarLocate open access versionFindings
  • Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2009.
    Google ScholarLocate open access versionFindings
  • Viliam Lisy, Marc Lanctot, and Michael Bowling. Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In Proceedings of the 2015 international conference on autonomous agents and multiagent systems, pages 27–36, 2015.
    Google ScholarLocate open access versionFindings
  • Matej Moravcík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, May 2017.
    Google ScholarLocate open access versionFindings
  • Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
    Google ScholarLocate open access versionFindings
  • Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory, pages 993–1019, 2013.
    Google ScholarLocate open access versionFindings
  • Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066–3074, 2013.
    Google ScholarLocate open access versionFindings
  • Sheldon M Ross. Goofspiel—the game of pure strategy. Journal of Applied Probability, 8(3): 621–625, 1971.
    Google ScholarLocate open access versionFindings
  • Shai Shalev-Shwartz and Yoram Singer. A primal-dual perspective of online learning algorithms. Machine Learning, 69(2-3):115–142, 2007.
    Google ScholarLocate open access versionFindings
  • Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI), July 2005.
    Google ScholarLocate open access versionFindings
  • Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, pages 2989–2997, 2015.
    Google ScholarLocate open access versionFindings
  • Oskari Tammelin. Solving large imperfect information games using cfr+. arXiv preprint arXiv:1407.5042, 2014.
    Findings
  • Bernhard von Stengel. Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220–246, 1996.
    Google ScholarLocate open access versionFindings
  • Kevin Waugh and Drew Bagnell. A unified view of large-scale zero-sum equilibrium computation. In Computer Poker and Imperfect Information Workshop at the AAAI Conference on Artificial Intelligence (AAAI), 2015.
    Google ScholarLocate open access versionFindings
  • Martin Zinkevich, Michael Bowling, Michael Johanson, and Carmelo Piccione. Regret minimization in games with incomplete information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007.
    Google ScholarLocate open access versionFindings
  • 2. Summing the two above inequalities and rearranging terms yields mt, xt − wt
    Google ScholarFindings
  • 2. B C In order to bound these terms, we use Lemma 3: mt, xt − zt
    Google ScholarFindings
  • 2. So, using the fact that x 2 ≤ 1 for any x ∈ ∆n, and applying Proposition 2, 1 RT (x) ≤ min
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments