# Faster Game Solving via Predictive Blackwell Approachability: Connecting Regret Matching and Mirror Descent

Weibo:

Abstract:

Blackwell approachability is a framework for reasoning about repeated games with vector-valued payoffs. We introduce predictive Blackwell approachability, where an estimate of the next payoff vector is given, and the decision maker tries to achieve better performance based on the accuracy of that estimator. In order to derive algorithms...More

Code:

Data:

Introduction

- Extensive-form games (EFGs) are the standard class of games that can be used to model sequential interaction, outcome uncertainty, and imperfect information.
- Operationalizing these models requires algorithms for computing game-theoretic equilibria.
- A recent success of EFGs is the use of Nash equilibrium for several recent poker AI milestones, such as essentially solving the game of limit Texas hold’em [6], and beating top human poker pros in no-limit Texas hold’em with the Libratus AI [7].

Highlights

- Extensive-form games (EFGs) are the standard class of games that can be used to model sequential interaction, outcome uncertainty, and imperfect information
- We show that regret matching (RM) and RM+ are the algorithms that result from running FTRL and online mirror descent (OMD), respectively, to select the halfspace to force at all times in the underlying Blackwell approachability game
- We introduced the notion of predictive Blackwell approachability
- We showed that predictive FTRL and OMD can be applied to this unbounded setting
- OMD applied to the same problem turned out to be equivalent to RM+, which is vastly faster than RM in practice
- Combining predictive regret matching (PRM)+ with CFR, we introduced the PCFR+ algorithm for solving EFGs

Methods

- The authors conduct experiments on solving two-player zero-sum games. As mentioned previously, for EFGs the CFR framework is used for decomposing regrets into local regret minimization problems at each simplex corresponding to a decision point in the game [42, 16], and the authors do the same.
- XT using the formula t2 xt, and the authors use alternating updates
- The authors call this algorithm PCFR+.
- The experiments shown in the main body are representative of those in the appendix.
- For all non-predictive algorithms (CFR+, LCFR, and DCFR), the authors let mt = 0.
- Both y-axes are in log scale

Conclusion

**Conclusions and Future Research**

The authors introduced the notion of predictive Blackwell approachability.- The authors showed that predictive FTRL and OMD can be applied to this unbounded setting.
- This extended reduction allowed them to show that FTRL applied to the decision of which halfspace to force in Blackwell approachability is equivalent to the regret matching algorithm.
- The authors showed that the predictive variants of FTRL and OMD yield predictive algorithms for Blackwell approachability, as well as predictive variants of RM and RM+.
- Can PRM+ guarantee T −1 convergence on matrix games like optimistic FTRL and OMD, or do the less stable updates prevent that? Can one develop a predictive variant of DCFR, which is faster on poker domains? Can one combine DCFR and PCFR+, so DCFR would be faster initially but PCFR+ would overtake? If the cross-over point could be approximated, this might yield a best-of-both-worlds algorithm

Summary

## Introduction:

Extensive-form games (EFGs) are the standard class of games that can be used to model sequential interaction, outcome uncertainty, and imperfect information.- Operationalizing these models requires algorithms for computing game-theoretic equilibria.
- A recent success of EFGs is the use of Nash equilibrium for several recent poker AI milestones, such as essentially solving the game of limit Texas hold’em [6], and beating top human poker pros in no-limit Texas hold’em with the Libratus AI [7].
## Methods:

The authors conduct experiments on solving two-player zero-sum games. As mentioned previously, for EFGs the CFR framework is used for decomposing regrets into local regret minimization problems at each simplex corresponding to a decision point in the game [42, 16], and the authors do the same.- XT using the formula t2 xt, and the authors use alternating updates
- The authors call this algorithm PCFR+.
- The experiments shown in the main body are representative of those in the appendix.
- For all non-predictive algorithms (CFR+, LCFR, and DCFR), the authors let mt = 0.
- Both y-axes are in log scale
## Conclusion:

**Conclusions and Future Research**

The authors introduced the notion of predictive Blackwell approachability.- The authors showed that predictive FTRL and OMD can be applied to this unbounded setting.
- This extended reduction allowed them to show that FTRL applied to the decision of which halfspace to force in Blackwell approachability is equivalent to the regret matching algorithm.
- The authors showed that the predictive variants of FTRL and OMD yield predictive algorithms for Blackwell approachability, as well as predictive variants of RM and RM+.
- Can PRM+ guarantee T −1 convergence on matrix games like optimistic FTRL and OMD, or do the less stable updates prevent that? Can one develop a predictive variant of DCFR, which is faster on poker domains? Can one combine DCFR and PCFR+, so DCFR would be faster initially but PCFR+ would overtake? If the cross-over point could be approximated, this might yield a best-of-both-worlds algorithm

Funding

- This material is based on work supported by the National Science Foundation under grants IIS1718457, IIS-1617590, IIS-1901403, and CCF-1733556, and the ARO under awards W911NF-171-0082 and W911NF2010081
- Gabriele Farina is supported by a Facebook fellowship

Reference

- Jacob Abernethy, Peter L Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In COLT, pages 27–46, 2011.
- David Blackwell. Controlled random walks. In Proceedings of the international congress of mathematicians, volume 3, pages 336–338, 1954.
- David Blackwell. An analog of the minmax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1–8, 1956.
- B Bošansky, Christopher Kiekintveld, V Lisý, and Michal Pechoucek. An exact double-oracle algorithm for zero-sum extensive-form games with imperfect information. Journal of Artificial Intelligence Research, pages 829–866, 2014.
- Branislav Bošanskyand Jirí Cermák. Sequence-form algorithm for computing Stackelberg equilibria in extensive-form games. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218), January 2015.
- Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, page eaao1733, Dec. 2017.
- Noam Brown and Tuomas Sandholm. Solving imperfect-information games via discounted regret minimization. In AAAI Conference on Artificial Intelligence (AAAI), 2019.
- Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker. Science, 365 (6456):885–890, 2019.
- Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning for regret minimization. In AAAI Conference on Artificial Intelligence (AAAI), 2017.
- Neil Burch. Time and space: Why imperfect information games are hard. 2018.
- Neil Burch, Matej Moravcik, and Martin Schmid. Revisiting CFR+ and alternating updates. Journal of Artificial Intelligence Research, 64:429–443, 2019.
- Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, pages 6–1, 2012.
- Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Online convex optimization for sequential decision processes and extensive-form games. In arXiv, 2018.
- Gabriele Farina, Christian Kroer, Noam Brown, and Tuomas Sandholm. Stable-predictive optimistic counterfactual regret minimization. In International Conference on Machine Learning (ICML), 2019.
- Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Online convex optimization for sequential decision processes and extensive-form games. In AAAI Conference on Artificial Intelligence, 2019.
- Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Optimistic regret minimization for extensive-form games via dilated distance-generating functions. In Advances in Neural Information Processing Systems, pages 5222–5232, 2019.
- Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Regret circuits: Composability of regret minimizers. In International Conference on Machine Learning, pages 1863–1872, 2019.
- Gabriele Farina, Chun Kai Ling, Fei Fang, and Tuomas Sandholm. Correlation in extensiveform games: Saddle-point formulation and benchmarks. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
- Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Stochastic regret minimization in extensive-form games. arXiv preprint arXiv:2002.08493, 2020.
- Dean P Foster. A proof of calibration via blackwell’s approachability theorem. Games and Economic Behavior, 29(1-2):73–78, 1999.
- Yuan Gao, Christian Kroer, and Donald Goldfarb. Increasing iterate averaging for solving saddle-point problems. arXiv preprint arXiv:1903.10646, 2019.
- Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127–1150, 2000.
- Samid Hoda, Andrew Gilpin, Javier Peña, and Tuomas Sandholm. Smoothing techniques for computing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2), 2010.
- Christian Kroer, Gabriele Farina, and Tuomas Sandholm. Robust stackelberg equilibria in extensive-form games and extension to limited lookahead. In AAAI Conference on Artificial Intelligence (AAAI), 2018.
- Christian Kroer, Gabriele Farina, and Tuomas Sandholm. Solving large sequential games with the excessive gap technique. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2018.
- Christian Kroer, Kevin Waugh, Fatma Kılınç-Karzan, and Tuomas Sandholm. Faster algorithms for extensive-form game solving via improved smoothing functions. Mathematical Programming, 2020.
- H. W. Kuhn. A simplified two-person poker. In H. W. Kuhn and A. W. Tucker, editors, Contributions to the Theory of Games, volume 1 of Annals of Mathematics Studies, 24, pages 97–103. Princeton University Press, Princeton, New Jersey, 1950.
- Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2009.
- Viliam Lisy, Marc Lanctot, and Michael Bowling. Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In Proceedings of the 2015 international conference on autonomous agents and multiagent systems, pages 27–36, 2015.
- Matej Moravcík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, May 2017.
- Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
- Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory, pages 993–1019, 2013.
- Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066–3074, 2013.
- Sheldon M Ross. Goofspiel—the game of pure strategy. Journal of Applied Probability, 8(3): 621–625, 1971.
- Shai Shalev-Shwartz and Yoram Singer. A primal-dual perspective of online learning algorithms. Machine Learning, 69(2-3):115–142, 2007.
- Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI), July 2005.
- Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, pages 2989–2997, 2015.
- Oskari Tammelin. Solving large imperfect information games using cfr+. arXiv preprint arXiv:1407.5042, 2014.
- Bernhard von Stengel. Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220–246, 1996.
- Kevin Waugh and Drew Bagnell. A unified view of large-scale zero-sum equilibrium computation. In Computer Poker and Imperfect Information Workshop at the AAAI Conference on Artificial Intelligence (AAAI), 2015.
- Martin Zinkevich, Michael Bowling, Michael Johanson, and Carmelo Piccione. Regret minimization in games with incomplete information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007.
- 2. Summing the two above inequalities and rearranging terms yields mt, xt − wt
- 2. B C In order to bound these terms, we use Lemma 3: mt, xt − zt
- 2. So, using the fact that x 2 ≤ 1 for any x ∈ ∆n, and applying Proposition 2, 1 RT (x) ≤ min

Full Text

Tags

Comments