# Stochastic Regret Minimization in Extensive-Form Games

ICML, pp. 3018-3028, 2020.

EI

Weibo:

Abstract:

Monte-Carlo counterfactual regret minimization (MCCFR) is the state-of-the-art algorithm for solving sequential games that are too large for full tree traversals. It works by using gradient estimates that can be computed via sampling. However, stochastic methods for sequential games have not been investigated extensively beyond MCCFR. I...More

Code:

Data:

Introduction

- Extensive-form games (EFGs) are a broad class of games that can model sequential and simultaneous moves, outcome uncertainty, and imperfect information.
- The game tree is only accessed for computing gradients, which can be done via a single tree traversal that does not require an explicit tree representation, and sometimes game structure can be exploited to speed this up further (Johanson et al, 2011)
- These gradients are used to update the strategy iterates.
- Eventually even these gradient-based methods that require traversing the entire game tree become too expensive
- This was seen in two recent superhuman poker AIs: Libratus (Brown & Sandholm, 2017) and Pluribus (Brown & Sandholm, 2019b).

Highlights

- Extensive-form games (EFGs) are a broad class of games that can model sequential and simultaneous moves, outcome uncertainty, and imperfect information
- A zero-sum Extensive-form games can be solved in polynomial time using a linear program whose size is linear in the size of the game tree
- We introduced a new framework for constructing stochastic regret-minimization methods for solving zero-sum games
- This framework completely decouples the choice of regret minimizer and gradient estimator, allowing any regret minimizer to be coupled with any gradient estimator
- Our framework yields a streamlined and dramatically simpler proof of MCCFR. It immediately gives a significantly stronger bound on the convergence rate of the MCCFR algorithm, whereby with probability 1 − p the regret grows as O( T log(1/p)) instead of O( T /p) as in the original analysis—an exponentially tighter bound
- Due to its modular nature, our framework opens the door to many possible future research questions around stochastic methods for solving Extensive-form games

Methods

- The authors perform numerical simulations to investigate the practical performance of several stochastic regretminimization algorithms.
- For each algorithm and game pair the authors show only the best-performing of these four stepsizes in the plots below.
- Goofspiel The variant of Goofspiel (Ross, 1971) that the authors use in the experiments is a two-player card game, employing three identical decks of 4 cards each.
- This game has 54,421 nodes and 21,329 sequences per player.
- The game has 732,607 nodes, 73,130 sequences for Player 1, and 253,940 sequences for Player 2

Results

- OMD seems to be more sensitive to stepsize, it performs significantly better on Battleship.
- In Goofspiel MCCFR performs significantly better than both FTRL and OMD.
- In Search-5 MCCFR performs significantly better than FTRL and OMD, FTRL seems to be catching up in later iterations

Conclusion

- The authors introduced a new framework for constructing stochastic regret-minimization methods for solving zero-sum games.
- The authors' framework yields a streamlined and dramatically simpler proof of MCCFR.
- Due to its modular nature, the framework opens the door to many possible future research questions around stochastic methods for solving EFGs. Among the most promising are methods for controlling the stepsize in, for instance, FTRL or OMD, as well as instantiating the framework with other regret minimizers

Summary

## Introduction:

Extensive-form games (EFGs) are a broad class of games that can model sequential and simultaneous moves, outcome uncertainty, and imperfect information.- The game tree is only accessed for computing gradients, which can be done via a single tree traversal that does not require an explicit tree representation, and sometimes game structure can be exploited to speed this up further (Johanson et al, 2011)
- These gradients are used to update the strategy iterates.
- Eventually even these gradient-based methods that require traversing the entire game tree become too expensive
- This was seen in two recent superhuman poker AIs: Libratus (Brown & Sandholm, 2017) and Pluribus (Brown & Sandholm, 2019b).
## Methods:

The authors perform numerical simulations to investigate the practical performance of several stochastic regretminimization algorithms.- For each algorithm and game pair the authors show only the best-performing of these four stepsizes in the plots below.
- Goofspiel The variant of Goofspiel (Ross, 1971) that the authors use in the experiments is a two-player card game, employing three identical decks of 4 cards each.
- This game has 54,421 nodes and 21,329 sequences per player.
- The game has 732,607 nodes, 73,130 sequences for Player 1, and 253,940 sequences for Player 2
## Results:

OMD seems to be more sensitive to stepsize, it performs significantly better on Battleship.- In Goofspiel MCCFR performs significantly better than both FTRL and OMD.
- In Search-5 MCCFR performs significantly better than FTRL and OMD, FTRL seems to be catching up in later iterations
## Conclusion:

The authors introduced a new framework for constructing stochastic regret-minimization methods for solving zero-sum games.- The authors' framework yields a streamlined and dramatically simpler proof of MCCFR.
- Due to its modular nature, the framework opens the door to many possible future research questions around stochastic methods for solving EFGs. Among the most promising are methods for controlling the stepsize in, for instance, FTRL or OMD, as well as instantiating the framework with other regret minimizers

Reference

- Abernethy, J. D. and Rakhlin, A. Beating the adaptive bandit with high probability. 2009 Information Theory and Applications Workshop, 2009.
- Archibald, C. and Shoham, Y. Modeling billiards games. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), Budapest, Hungary, 2009.
- Azuma, K. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3):357–367, 1967.
- Bartlett, P. L., Dani, V., Hayes, T., Kakade, S., Rakhlin, A., and Tewari, A. High-probability regret bounds for bandit online linear optimization. In Conference on Learning Theory (COLT), 2008.
- Blackwell, D. and Freedman, D. On the amount of variance needed to escape from a strip. The Annals of Probability, pp. 772–787, 1973.
- Bošansky, B. and Cermák, J. Sequence-form algorithm for computing Stackelberg equilibria in extensive-form games. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Bošansky, B., Kiekintveld, C., Lisý, V., and Pechoucek, M. An exact double-oracle algorithm for zero-sum extensiveform games with imperfect information. Journal of Artificial Intelligence Research, pp. 829–866, 2014.
- Brown, N. and Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, pp. eaao1733, Dec. 2017.
- Brown, N. and Sandholm, T. Solving imperfect-information games via discounted regret minimization. In AAAI Conference on Artificial Intelligence (AAAI), 2019a.
- Brown, N. and Sandholm, T. Superhuman AI for multiplayer poker. Science, 365(6456):885–890, 2019b.
- Chen, K. and Bowling, M. Tractable objectives for robust policy optimization. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2012.
- Chen, X., Han, Z., Zhang, H., Xue, G., Xiao, Y., and Bennis, M. Wireless resource scheduling in virtualized radio access networks using stochastic learning. IEEE Transactions on Mobile Computing, (1):1–1, 2018.
- DeBruhl, B., Kroer, C., Datta, A., Sandholm, T., and Tague, P. Power napping with loud neighbors: optimal energyconstrained jamming and anti-jamming. In Proceedings of the 2014 ACM conference on Security and privacy in wireless & mobile networks, pp. 117–128. ACM, 2014.
- Farina, G., Kroer, C., and Sandholm, T. Online convex optimization for sequential decision processes and extensiveform games. In AAAI Conference on Artificial Intelligence, 2019a.
- Farina, G., Kroer, C., and Sandholm, T. Regret circuits: Composability of regret minimizers. In International Conference on Machine Learning, pp. 1863–1872, 2019b.
- Farina, G., Ling, C. K., Fang, F., and Sandholm, T. Correlation in extensive-form games: Saddle-point formulation and benchmarks. In Conference on Neural Information Processing Systems (NeurIPS), 2019c.
- Freedman, D. A. On tail probabilities for martingales. The Annals of Probability, 3(1):100–118, 02 1975.
- Gibson, R., Lanctot, M., Burch, N., Szafron, D., and Bowling, M. Generalized sampling and variance in counterfactual regret minimization. In AAAI Conference on Artificial Intelligence (AAAI), 2012.
- Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68: 1127–1150, 2000.
- Hoda, S., Gilpin, A., Peña, J., and Sandholm, T. Smoothing techniques for computing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2), 2010.
- Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
- Johanson, M., Waugh, K., Bowling, M., and Zinkevich, M. Accelerating best response calculation in large extensive games. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2011.
- Koller, D., Megiddo, N., and von Stengel, B. Efficient computation of equilibria for extensive two-person games. Games and Economic Behavior, 14(2), 1996.
- Kroer, C., Waugh, K., Kılınç-Karzan, F., and Sandholm, T. Faster first-order methods for extensive-form game solving. In Proceedings of the ACM Conference on Economics and Computation (EC), 2015.
- Kroer, C., Farina, G., and Sandholm, T. Robust stackelberg equilibria in extensive-form games and extension to limited lookahead. In AAAI Conference on Artificial Intelligence (AAAI), 2018.
- Kroer, C., Waugh, K., Kılınç-Karzan, F., and Sandholm, T. Faster algorithms for extensive-form game solving via improved smoothing functions. Mathematical Programming, 2020.
- Lanctot, M., Waugh, K., Zinkevich, M., and Bowling, M. Monte Carlo sampling for regret minimization in extensive games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2009.
- Lisý, V., Davis, T., and Bowling, M. Counterfactual regret minimization in sequential security games. In AAAI Conference on Artificial Intelligence (AAAI), 2016.
- McDiarmid, C. Concentration, pp. 195–248. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998. ISBN 9783-662-12788-9. doi: 10.1007/978-3-662-12788-9_6.
- Moravcík, M., Schmid, M., Burch, N., Lisý, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, May 2017.
- Munoz de Cote, E., Stranders, R., Basilico, N., Gatti, N., and Jennings, N. Introducing alarms in adversarial patrolling games. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 1275–1276. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
- Orabona, F. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
- Romanovskii, I. Reduction of a game with complete memory to a matrix game. Soviet Mathematics, 3, 1962.
- Ross, S. M. Goofspiel—the game of pure strategy. Journal of Applied Probability, 8(3):621–625, 1971.
- Sandholm, T. The state of solving large incompleteinformation games, and application to poker. AI Magazine, 2010. Special issue on Algorithmic Game Theory.
- Schmid, M., Burch, N., Lanctot, M., Moravcik, M., Kadlec, R., and Bowling, M. Variance reduction in monte carlo counterfactual regret minimization (vr-mccfr) for extensive form games using baselines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 2157–2164, 2019.
- Shalev-Shwartz, S. and Singer, Y. A primal-dual perspective of online learning algorithms. Machine Learning, 69(2-3): 115–142, 2007.
- Southey, F., Bowling, M., Larson, B., Piccione, C., Burch, N., Billings, D., and Rayner, C. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI), July 2005.
- Tammelin, O., Burch, N., Johanson, M., and Bowling, M. Solving heads-up limit Texas hold’em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015.
- von Stengel, B. Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220–246, 1996.
- Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML), pp. 928–936, Washington, DC, USA, 2003.
- Zinkevich, M., Bowling, M., Johanson, M., and Piccione, C. Regret minimization in games with incomplete information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007.
- For completeness, we show a proof of Proposition 1. As mentioned, it is an application of the Azuma-Hoeffding inequality for martingale difference sequences, which we now state (see, e.g., Theorem 3.14 of McDiarmid (1998) for a proof).

Full Text

Tags

Comments