Finding and Certifying (Near-)Optimal Strategies in Black-Box Extensive-Form Games

Cited by: 0|Bibtex|Views10
Other Links: arxiv.org
Weibo:
We developed an Monte Carlo CFR-like equilibrium-finding algorithm that converges at rate O( log(t)/t), and does not require a lower-bounded sampling vector

Abstract:

Often---for example in war games, strategy video games, and financial simulations---the game is given to us only as a black-box simulator in which we can play it. In these settings, since the game may have unknown nature action distributions (from which we can only obtain samples) and/or be too large to expand fully, it can be difficult...More

Code:

Data:

0
Introduction
  • Computational equilibrium finding has led to many recent breakthroughs in AI in games such as poker (Bowling et al 2015; Brown and Sandholm 2017, 2019b) where the game is fully known.
  • In many applications, the game is not fully known; instead, it is given only via a simulator that permits an algorithm to play through the game repeatedly (e.g., Wellman 2006; Lanctot et al 2017).
  • This, assumes the whole game to be known exactly
Highlights
  • Computational equilibrium finding has led to many recent breakthroughs in AI in games such as poker (Bowling et al 2015; Brown and Sandholm 2017, 2019b) where the game is fully known
  • We develop an algorithm for extensiveform game solving that enjoys many of the same properties of outcome-sampling Monte Carlo CFR (MCCFR) but works without the problematic assumption of having an a-priori uniformly-lowerbounded “sampling vector” that is required by MCCFR
  • Our goal in this paper is to develop equilibrium-finding algorithms that give anytime, high-probability, instance-specific exploitability guarantees that can be computed without expanding the rest of the game tree, and are better than the generic guarantees given by the worst-case runtime bounds of algorithms like MCCFR
  • Assuming that the confidence sequence is correct at time t, the pessimistic equilibrium computed by Algorithm 6.1 is an εt-equilibrium of Gt. This allows us to know when we have found an ε-equilibrium, without expanding the remainder of the game tree, even in the case when chance’s strategy is not directly observable
  • We developed an MCCFR-like equilibrium-finding algorithm that converges at rate O( log(t)/t), and does not require a lower-bounded sampling vector
Methods
  • The authors conducted experiments on two common benchmarks:

    (1) k-rank Goofspiel. At each time t = 1, . . . , k, both players simultaneously place a bid for a prize.
  • The authors conducted experiments on two common benchmarks:.
  • (1) k-rank Goofspiel.
  • At each time t = 1, .
  • K, both players simultaneously place a bid for a prize.
  • The prizes have values 1, .
  • K, and are randomly shuffled.
  • The valid bids are 1, .
  • The higher bid wins the prize; in case of a tie, the prize is split.
  • The winner of each round is made public, but the bids are not.
  • The authors' experiments use k = 4
Conclusion
  • The authors developed algorithms that construct high-probability certificates in games with only black-box access.
  • The authors' method can be used with either an exact game solver (e.g., LP solver) as a subroutine or a regret minimizer such as MCCFR.
  • Table 1 shows which algorithm the authors recommend based on the use case.
  • The authors developed an MCCFR-like equilibrium-finding algorithm that converges at rate O( log(t)/t), and does not require a lower-bounded sampling vector.
  • The authors' experiments show that the algorithms produce nontrivial certificates with very few samples.
Summary
  • Introduction:

    Computational equilibrium finding has led to many recent breakthroughs in AI in games such as poker (Bowling et al 2015; Brown and Sandholm 2017, 2019b) where the game is fully known.
  • In many applications, the game is not fully known; instead, it is given only via a simulator that permits an algorithm to play through the game repeatedly (e.g., Wellman 2006; Lanctot et al 2017).
  • This, assumes the whole game to be known exactly
  • Objectives:

    After t playthroughs, to efficiently maintain a strategy profile σt and bounds εi,t on the equilibrium gap of each player’s strategy that are correct with probability 1 − 1/ poly(t).
  • Methods:

    The authors conducted experiments on two common benchmarks:

    (1) k-rank Goofspiel. At each time t = 1, . . . , k, both players simultaneously place a bid for a prize.
  • The authors conducted experiments on two common benchmarks:.
  • (1) k-rank Goofspiel.
  • At each time t = 1, .
  • K, both players simultaneously place a bid for a prize.
  • The prizes have values 1, .
  • K, and are randomly shuffled.
  • The valid bids are 1, .
  • The higher bid wins the prize; in case of a tie, the prize is split.
  • The winner of each round is made public, but the bids are not.
  • The authors' experiments use k = 4
  • Conclusion:

    The authors developed algorithms that construct high-probability certificates in games with only black-box access.
  • The authors' method can be used with either an exact game solver (e.g., LP solver) as a subroutine or a regret minimizer such as MCCFR.
  • Table 1 shows which algorithm the authors recommend based on the use case.
  • The authors developed an MCCFR-like equilibrium-finding algorithm that converges at rate O( log(t)/t), and does not require a lower-bounded sampling vector.
  • The authors' experiments show that the algorithms produce nontrivial certificates with very few samples.
Tables
  • Table1: Algorithms we suggest by use case in two-player zero-sum games. Sampling-limited means that the black-box game simulator is relatively slow or expensive compared to solving the pseudogames. Compute-limited means that the simulator is fast or cheap compared to solving the pseudogames. In general-sum games, only Algorithm 7.2 is usable
Download tables as Excel
Reference
  • Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3): 235–256.
    Google ScholarLocate open access versionFindings
  • Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. 2019. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680.
    Findings
  • Bowling, M.; Burch, N.; Johanson, M.; and Tammelin, O. 2015. Heads-up Limit Hold’em Poker is Solved. Science 347(6218).
    Google ScholarLocate open access versionFindings
  • Brown, N.; and Sandholm, T. 2017. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science eaao1733.
    Google ScholarLocate open access versionFindings
  • Brown, N.; and Sandholm, T. 2019a. Solving imperfectinformation games via discounted regret minimization. In AAAI Conference on Artificial Intelligence (AAAI).
    Google ScholarLocate open access versionFindings
  • Brown, N.; and Sandholm, T. 2019b. Superhuman AI for multiplayer poker. Science 365(6456): 885–890.
    Google ScholarLocate open access versionFindings
  • Farina, G.; Kroer, C.; and Sandholm, T. 2020. Stochastic regret minimization in extensive-form games. arXiv preprint arXiv:2002.08493.
    Findings
  • Gurobi Optimization, LLC. 2019. Gurobi Optimizer Reference Manual.
    Google ScholarFindings
  • Hart, P.; Nilsson, N.; and Raphael, B. 1968. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetics 4(2): 100–107.
    Google ScholarLocate open access versionFindings
  • Hart, S.; and Mas-Colell, A. 2000. A Simple Adaptive Procedure Leading to Correlated Equilibrium. Econometrica 68: 1127–1150.
    Google ScholarLocate open access versionFindings
  • Hoda, S.; Gilpin, A.; Pena, J.; and Sandholm, T. 2010. Smoothing Techniques for Computing Nash Equilibria of Sequential Games. Mathematics of Operations Research 35(2).
    Google ScholarLocate open access versionFindings
  • Koller, D.; Megiddo, N.; and von Stengel, B. 1994. Fast algorithms for finding randomized strategies in game trees. In Proceedings of the 26th ACM Symposium on Theory of Computing (STOC).
    Google ScholarLocate open access versionFindings
  • Kroer, C.; Farina, G.; and Sandholm, T. 2018. Solving Large Sequential Games with the Excessive Gap Technique. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M. 2009. Monte Carlo Sampling for Regret Minimization in Extensive Games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Perolat, J.; Silver, D.; and Graepel, T. 2017. A unified game-theoretic approach to multiagent reinforcement learning. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 4190–4203.
    Google ScholarLocate open access versionFindings
  • Southey, F.; Bowling, M.; Larson, B.; Piccione, C.; Burch, N.; Billings, D.; and Rayner, C. 2005. Bayes’ Bluff: Opponent Modelling in Poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI).
    Google ScholarLocate open access versionFindings
  • Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782): 350–354.
    Google ScholarLocate open access versionFindings
  • Wellman, M. 2006. Methods for Empirical Game-Theoretic Analysis (Extended Abstract). In Proceedings of the National Conference on Artificial Intelligence (AAAI), 1552– 1555.
    Google ScholarLocate open access versionFindings
  • Zhang, B. H.; and Sandholm, T. 2020. Small Nash Equilibrium Certificates in Very Large Games. arXiv preprint arXiv:2006.16387.
    Findings
  • Zinkevich, M. 2003. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In International Conference on Machine Learning (ICML), 928–936.
    Google ScholarLocate open access versionFindings
  • Zinkevich, M.; Bowling, M.; Johanson, M.; and Piccione, C. 2007. Regret Minimization in Games with Incomplete Information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • A.6 Proposition 7.5 Identical to Theorem 1 of Farina, Kroer, and Sandholm (2020).
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments