Stochastic Regret Minimization in Extensive-Form Games

ICML, pp. 3018-3028, 2020.

Cited by: 3|Bibtex|Views17
EI
Other Links: arxiv.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
This framework completely decouples the choice of regret minimizer and gradient estimator, allowing any regret minimizer to be coupled with any gradient estimator

Abstract:

Monte-Carlo counterfactual regret minimization (MCCFR) is the state-of-the-art algorithm for solving sequential games that are too large for full tree traversals. It works by using gradient estimates that can be computed via sampling. However, stochastic methods for sequential games have not been investigated extensively beyond MCCFR. I...More

Code:

Data:

0
Introduction
  • Extensive-form games (EFGs) are a broad class of games that can model sequential and simultaneous moves, outcome uncertainty, and imperfect information.
  • The game tree is only accessed for computing gradients, which can be done via a single tree traversal that does not require an explicit tree representation, and sometimes game structure can be exploited to speed this up further (Johanson et al, 2011)
  • These gradients are used to update the strategy iterates.
  • Eventually even these gradient-based methods that require traversing the entire game tree become too expensive
  • This was seen in two recent superhuman poker AIs: Libratus (Brown & Sandholm, 2017) and Pluribus (Brown & Sandholm, 2019b).
Highlights
  • Extensive-form games (EFGs) are a broad class of games that can model sequential and simultaneous moves, outcome uncertainty, and imperfect information
  • A zero-sum Extensive-form games can be solved in polynomial time using a linear program whose size is linear in the size of the game tree
  • We introduced a new framework for constructing stochastic regret-minimization methods for solving zero-sum games
  • This framework completely decouples the choice of regret minimizer and gradient estimator, allowing any regret minimizer to be coupled with any gradient estimator
  • Our framework yields a streamlined and dramatically simpler proof of MCCFR. It immediately gives a significantly stronger bound on the convergence rate of the MCCFR algorithm, whereby with probability 1 − p the regret grows as O( T log(1/p)) instead of O( T /p) as in the original analysis—an exponentially tighter bound
  • Due to its modular nature, our framework opens the door to many possible future research questions around stochastic methods for solving Extensive-form games
Methods
  • The authors perform numerical simulations to investigate the practical performance of several stochastic regretminimization algorithms.
  • For each algorithm and game pair the authors show only the best-performing of these four stepsizes in the plots below.
  • Goofspiel The variant of Goofspiel (Ross, 1971) that the authors use in the experiments is a two-player card game, employing three identical decks of 4 cards each.
  • This game has 54,421 nodes and 21,329 sequences per player.
  • The game has 732,607 nodes, 73,130 sequences for Player 1, and 253,940 sequences for Player 2
Results
  • OMD seems to be more sensitive to stepsize, it performs significantly better on Battleship.
  • In Goofspiel MCCFR performs significantly better than both FTRL and OMD.
  • In Search-5 MCCFR performs significantly better than FTRL and OMD, FTRL seems to be catching up in later iterations
Conclusion
  • The authors introduced a new framework for constructing stochastic regret-minimization methods for solving zero-sum games.
  • The authors' framework yields a streamlined and dramatically simpler proof of MCCFR.
  • Due to its modular nature, the framework opens the door to many possible future research questions around stochastic methods for solving EFGs. Among the most promising are methods for controlling the stepsize in, for instance, FTRL or OMD, as well as instantiating the framework with other regret minimizers
Summary
  • Introduction:

    Extensive-form games (EFGs) are a broad class of games that can model sequential and simultaneous moves, outcome uncertainty, and imperfect information.
  • The game tree is only accessed for computing gradients, which can be done via a single tree traversal that does not require an explicit tree representation, and sometimes game structure can be exploited to speed this up further (Johanson et al, 2011)
  • These gradients are used to update the strategy iterates.
  • Eventually even these gradient-based methods that require traversing the entire game tree become too expensive
  • This was seen in two recent superhuman poker AIs: Libratus (Brown & Sandholm, 2017) and Pluribus (Brown & Sandholm, 2019b).
  • Methods:

    The authors perform numerical simulations to investigate the practical performance of several stochastic regretminimization algorithms.
  • For each algorithm and game pair the authors show only the best-performing of these four stepsizes in the plots below.
  • Goofspiel The variant of Goofspiel (Ross, 1971) that the authors use in the experiments is a two-player card game, employing three identical decks of 4 cards each.
  • This game has 54,421 nodes and 21,329 sequences per player.
  • The game has 732,607 nodes, 73,130 sequences for Player 1, and 253,940 sequences for Player 2
  • Results:

    OMD seems to be more sensitive to stepsize, it performs significantly better on Battleship.
  • In Goofspiel MCCFR performs significantly better than both FTRL and OMD.
  • In Search-5 MCCFR performs significantly better than FTRL and OMD, FTRL seems to be catching up in later iterations
  • Conclusion:

    The authors introduced a new framework for constructing stochastic regret-minimization methods for solving zero-sum games.
  • The authors' framework yields a streamlined and dramatically simpler proof of MCCFR.
  • Due to its modular nature, the framework opens the door to many possible future research questions around stochastic methods for solving EFGs. Among the most promising are methods for controlling the stepsize in, for instance, FTRL or OMD, as well as instantiating the framework with other regret minimizers
Reference
  • Abernethy, J. D. and Rakhlin, A. Beating the adaptive bandit with high probability. 2009 Information Theory and Applications Workshop, 2009.
    Google ScholarLocate open access versionFindings
  • Archibald, C. and Shoham, Y. Modeling billiards games. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), Budapest, Hungary, 2009.
    Google ScholarLocate open access versionFindings
  • Azuma, K. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3):357–367, 1967.
    Google ScholarLocate open access versionFindings
  • Bartlett, P. L., Dani, V., Hayes, T., Kakade, S., Rakhlin, A., and Tewari, A. High-probability regret bounds for bandit online linear optimization. In Conference on Learning Theory (COLT), 2008.
    Google ScholarLocate open access versionFindings
  • Blackwell, D. and Freedman, D. On the amount of variance needed to escape from a strip. The Annals of Probability, pp. 772–787, 1973.
    Google ScholarLocate open access versionFindings
  • Bošansky, B. and Cermák, J. Sequence-form algorithm for computing Stackelberg equilibria in extensive-form games. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • Bošansky, B., Kiekintveld, C., Lisý, V., and Pechoucek, M. An exact double-oracle algorithm for zero-sum extensiveform games with imperfect information. Journal of Artificial Intelligence Research, pp. 829–866, 2014.
    Google ScholarLocate open access versionFindings
  • Brown, N. and Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, pp. eaao1733, Dec. 2017.
    Google ScholarLocate open access versionFindings
  • Brown, N. and Sandholm, T. Solving imperfect-information games via discounted regret minimization. In AAAI Conference on Artificial Intelligence (AAAI), 2019a.
    Google ScholarLocate open access versionFindings
  • Brown, N. and Sandholm, T. Superhuman AI for multiplayer poker. Science, 365(6456):885–890, 2019b.
    Google ScholarLocate open access versionFindings
  • Chen, K. and Bowling, M. Tractable objectives for robust policy optimization. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2012.
    Google ScholarLocate open access versionFindings
  • Chen, X., Han, Z., Zhang, H., Xue, G., Xiao, Y., and Bennis, M. Wireless resource scheduling in virtualized radio access networks using stochastic learning. IEEE Transactions on Mobile Computing, (1):1–1, 2018.
    Google ScholarLocate open access versionFindings
  • DeBruhl, B., Kroer, C., Datta, A., Sandholm, T., and Tague, P. Power napping with loud neighbors: optimal energyconstrained jamming and anti-jamming. In Proceedings of the 2014 ACM conference on Security and privacy in wireless & mobile networks, pp. 117–128. ACM, 2014.
    Google ScholarLocate open access versionFindings
  • Farina, G., Kroer, C., and Sandholm, T. Online convex optimization for sequential decision processes and extensiveform games. In AAAI Conference on Artificial Intelligence, 2019a.
    Google ScholarLocate open access versionFindings
  • Farina, G., Kroer, C., and Sandholm, T. Regret circuits: Composability of regret minimizers. In International Conference on Machine Learning, pp. 1863–1872, 2019b.
    Google ScholarLocate open access versionFindings
  • Farina, G., Ling, C. K., Fang, F., and Sandholm, T. Correlation in extensive-form games: Saddle-point formulation and benchmarks. In Conference on Neural Information Processing Systems (NeurIPS), 2019c.
    Google ScholarLocate open access versionFindings
  • Freedman, D. A. On tail probabilities for martingales. The Annals of Probability, 3(1):100–118, 02 1975.
    Google ScholarLocate open access versionFindings
  • Gibson, R., Lanctot, M., Burch, N., Szafron, D., and Bowling, M. Generalized sampling and variance in counterfactual regret minimization. In AAAI Conference on Artificial Intelligence (AAAI), 2012.
    Google ScholarLocate open access versionFindings
  • Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68: 1127–1150, 2000.
    Google ScholarLocate open access versionFindings
  • Hoda, S., Gilpin, A., Peña, J., and Sandholm, T. Smoothing techniques for computing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2), 2010.
    Google ScholarLocate open access versionFindings
  • Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
    Google ScholarLocate open access versionFindings
  • Johanson, M., Waugh, K., Bowling, M., and Zinkevich, M. Accelerating best response calculation in large extensive games. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2011.
    Google ScholarLocate open access versionFindings
  • Koller, D., Megiddo, N., and von Stengel, B. Efficient computation of equilibria for extensive two-person games. Games and Economic Behavior, 14(2), 1996.
    Google ScholarLocate open access versionFindings
  • Kroer, C., Waugh, K., Kılınç-Karzan, F., and Sandholm, T. Faster first-order methods for extensive-form game solving. In Proceedings of the ACM Conference on Economics and Computation (EC), 2015.
    Google ScholarLocate open access versionFindings
  • Kroer, C., Farina, G., and Sandholm, T. Robust stackelberg equilibria in extensive-form games and extension to limited lookahead. In AAAI Conference on Artificial Intelligence (AAAI), 2018.
    Google ScholarLocate open access versionFindings
  • Kroer, C., Waugh, K., Kılınç-Karzan, F., and Sandholm, T. Faster algorithms for extensive-form game solving via improved smoothing functions. Mathematical Programming, 2020.
    Google ScholarLocate open access versionFindings
  • Lanctot, M., Waugh, K., Zinkevich, M., and Bowling, M. Monte Carlo sampling for regret minimization in extensive games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2009.
    Google ScholarLocate open access versionFindings
  • Lisý, V., Davis, T., and Bowling, M. Counterfactual regret minimization in sequential security games. In AAAI Conference on Artificial Intelligence (AAAI), 2016.
    Google ScholarLocate open access versionFindings
  • McDiarmid, C. Concentration, pp. 195–248. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998. ISBN 9783-662-12788-9. doi: 10.1007/978-3-662-12788-9_6.
    Findings
  • Moravcík, M., Schmid, M., Burch, N., Lisý, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, May 2017.
    Google ScholarLocate open access versionFindings
  • Munoz de Cote, E., Stranders, R., Basilico, N., Gatti, N., and Jennings, N. Introducing alarms in adversarial patrolling games. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 1275–1276. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
    Google ScholarLocate open access versionFindings
  • Orabona, F. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
    Findings
  • Romanovskii, I. Reduction of a game with complete memory to a matrix game. Soviet Mathematics, 3, 1962.
    Google ScholarLocate open access versionFindings
  • Ross, S. M. Goofspiel—the game of pure strategy. Journal of Applied Probability, 8(3):621–625, 1971.
    Google ScholarLocate open access versionFindings
  • Sandholm, T. The state of solving large incompleteinformation games, and application to poker. AI Magazine, 2010. Special issue on Algorithmic Game Theory.
    Google ScholarLocate open access versionFindings
  • Schmid, M., Burch, N., Lanctot, M., Moravcik, M., Kadlec, R., and Bowling, M. Variance reduction in monte carlo counterfactual regret minimization (vr-mccfr) for extensive form games using baselines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 2157–2164, 2019.
    Google ScholarLocate open access versionFindings
  • Shalev-Shwartz, S. and Singer, Y. A primal-dual perspective of online learning algorithms. Machine Learning, 69(2-3): 115–142, 2007.
    Google ScholarLocate open access versionFindings
  • Southey, F., Bowling, M., Larson, B., Piccione, C., Burch, N., Billings, D., and Rayner, C. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI), July 2005.
    Google ScholarLocate open access versionFindings
  • Tammelin, O., Burch, N., Johanson, M., and Bowling, M. Solving heads-up limit Texas hold’em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), 2015.
    Google ScholarLocate open access versionFindings
  • von Stengel, B. Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220–246, 1996.
    Google ScholarLocate open access versionFindings
  • Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML), pp. 928–936, Washington, DC, USA, 2003.
    Google ScholarLocate open access versionFindings
  • Zinkevich, M., Bowling, M., Johanson, M., and Piccione, C. Regret minimization in games with incomplete information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2007.
    Google ScholarLocate open access versionFindings
  • For completeness, we show a proof of Proposition 1. As mentioned, it is an application of the Azuma-Hoeffding inequality for martingale difference sequences, which we now state (see, e.g., Theorem 3.14 of McDiarmid (1998) for a proof).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments