# Finding and Certifying (Near-)Optimal Strategies in Black-Box Extensive-Form Games

Weibo:

Abstract:

Often---for example in war games, strategy video games, and financial simulations---the game is given to us only as a black-box simulator in which we can play it. In these settings, since the game may have unknown nature action distributions (from which we can only obtain samples) and/or be too large to expand fully, it can be difficult...More

Code:

Data:

Introduction

- Computational equilibrium finding has led to many recent breakthroughs in AI in games such as poker (Bowling et al 2015; Brown and Sandholm 2017, 2019b) where the game is fully known.
- In many applications, the game is not fully known; instead, it is given only via a simulator that permits an algorithm to play through the game repeatedly (e.g., Wellman 2006; Lanctot et al 2017).
- This, assumes the whole game to be known exactly

Highlights

- Computational equilibrium finding has led to many recent breakthroughs in AI in games such as poker (Bowling et al 2015; Brown and Sandholm 2017, 2019b) where the game is fully known
- We develop an algorithm for extensiveform game solving that enjoys many of the same properties of outcome-sampling Monte Carlo CFR (MCCFR) but works without the problematic assumption of having an a-priori uniformly-lowerbounded “sampling vector” that is required by MCCFR
- Our goal in this paper is to develop equilibrium-finding algorithms that give anytime, high-probability, instance-specific exploitability guarantees that can be computed without expanding the rest of the game tree, and are better than the generic guarantees given by the worst-case runtime bounds of algorithms like MCCFR
- Assuming that the confidence sequence is correct at time t, the pessimistic equilibrium computed by Algorithm 6.1 is an εt-equilibrium of Gt. This allows us to know when we have found an ε-equilibrium, without expanding the remainder of the game tree, even in the case when chance’s strategy is not directly observable
- We developed an MCCFR-like equilibrium-finding algorithm that converges at rate O( log(t)/t), and does not require a lower-bounded sampling vector

Methods

- The authors conducted experiments on two common benchmarks:

(1) k-rank Goofspiel. At each time t = 1, . . . , k, both players simultaneously place a bid for a prize. - The authors conducted experiments on two common benchmarks:.
- (1) k-rank Goofspiel.
- At each time t = 1, .
- K, both players simultaneously place a bid for a prize.
- The prizes have values 1, .
- K, and are randomly shuffled.
- The valid bids are 1, .
- The higher bid wins the prize; in case of a tie, the prize is split.
- The winner of each round is made public, but the bids are not.
- The authors' experiments use k = 4

Conclusion

- The authors developed algorithms that construct high-probability certificates in games with only black-box access.
- The authors' method can be used with either an exact game solver (e.g., LP solver) as a subroutine or a regret minimizer such as MCCFR.
- Table 1 shows which algorithm the authors recommend based on the use case.
- The authors developed an MCCFR-like equilibrium-finding algorithm that converges at rate O( log(t)/t), and does not require a lower-bounded sampling vector.
- The authors' experiments show that the algorithms produce nontrivial certificates with very few samples.

Summary

## Introduction:

Computational equilibrium finding has led to many recent breakthroughs in AI in games such as poker (Bowling et al 2015; Brown and Sandholm 2017, 2019b) where the game is fully known.- In many applications, the game is not fully known; instead, it is given only via a simulator that permits an algorithm to play through the game repeatedly (e.g., Wellman 2006; Lanctot et al 2017).
- This, assumes the whole game to be known exactly
## Objectives:

After t playthroughs, to efficiently maintain a strategy profile σt and bounds εi,t on the equilibrium gap of each player’s strategy that are correct with probability 1 − 1/ poly(t).## Methods:

The authors conducted experiments on two common benchmarks:

(1) k-rank Goofspiel. At each time t = 1, . . . , k, both players simultaneously place a bid for a prize.- The authors conducted experiments on two common benchmarks:.
- (1) k-rank Goofspiel.
- At each time t = 1, .
- K, both players simultaneously place a bid for a prize.
- The prizes have values 1, .
- K, and are randomly shuffled.
- The valid bids are 1, .
- The higher bid wins the prize; in case of a tie, the prize is split.
- The winner of each round is made public, but the bids are not.
- The authors' experiments use k = 4
## Conclusion:

The authors developed algorithms that construct high-probability certificates in games with only black-box access.- The authors' method can be used with either an exact game solver (e.g., LP solver) as a subroutine or a regret minimizer such as MCCFR.
- Table 1 shows which algorithm the authors recommend based on the use case.
- The authors developed an MCCFR-like equilibrium-finding algorithm that converges at rate O( log(t)/t), and does not require a lower-bounded sampling vector.
- The authors' experiments show that the algorithms produce nontrivial certificates with very few samples.

- Table1: Algorithms we suggest by use case in two-player zero-sum games. Sampling-limited means that the black-box game simulator is relatively slow or expensive compared to solving the pseudogames. Compute-limited means that the simulator is fast or cheap compared to solving the pseudogames. In general-sum games, only Algorithm 7.2 is usable

Reference

- Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3): 235–256.
- Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. 2019. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680.
- Bowling, M.; Burch, N.; Johanson, M.; and Tammelin, O. 2015. Heads-up Limit Hold’em Poker is Solved. Science 347(6218).
- Brown, N.; and Sandholm, T. 2017. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science eaao1733.
- Brown, N.; and Sandholm, T. 2019a. Solving imperfectinformation games via discounted regret minimization. In AAAI Conference on Artificial Intelligence (AAAI).
- Brown, N.; and Sandholm, T. 2019b. Superhuman AI for multiplayer poker. Science 365(6456): 885–890.
- Farina, G.; Kroer, C.; and Sandholm, T. 2020. Stochastic regret minimization in extensive-form games. arXiv preprint arXiv:2002.08493.
- Gurobi Optimization, LLC. 2019. Gurobi Optimizer Reference Manual.
- Hart, P.; Nilsson, N.; and Raphael, B. 1968. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetics 4(2): 100–107.
- Hart, S.; and Mas-Colell, A. 2000. A Simple Adaptive Procedure Leading to Correlated Equilibrium. Econometrica 68: 1127–1150.
- Hoda, S.; Gilpin, A.; Pena, J.; and Sandholm, T. 2010. Smoothing Techniques for Computing Nash Equilibria of Sequential Games. Mathematics of Operations Research 35(2).
- Koller, D.; Megiddo, N.; and von Stengel, B. 1994. Fast algorithms for finding randomized strategies in game trees. In Proceedings of the 26th ACM Symposium on Theory of Computing (STOC).
- Kroer, C.; Farina, G.; and Sandholm, T. 2018. Solving Large Sequential Games with the Excessive Gap Technique. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).
- Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M. 2009. Monte Carlo Sampling for Regret Minimization in Extensive Games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).
- Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Perolat, J.; Silver, D.; and Graepel, T. 2017. A unified game-theoretic approach to multiagent reinforcement learning. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 4190–4203.
- Southey, F.; Bowling, M.; Larson, B.; Piccione, C.; Burch, N.; Billings, D.; and Rayner, C. 2005. Bayes’ Bluff: Opponent Modelling in Poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI).
- Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782): 350–354.
- Wellman, M. 2006. Methods for Empirical Game-Theoretic Analysis (Extended Abstract). In Proceedings of the National Conference on Artificial Intelligence (AAAI), 1552– 1555.
- Zhang, B. H.; and Sandholm, T. 2020. Small Nash Equilibrium Certificates in Very Large Games. arXiv preprint arXiv:2006.16387.
- Zinkevich, M. 2003. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In International Conference on Machine Learning (ICML), 928–936.
- Zinkevich, M.; Bowling, M.; Johanson, M.; and Piccione, C. 2007. Regret Minimization in Games with Incomplete Information. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).
- A.6 Proposition 7.5 Identical to Theorem 1 of Farina, Kroer, and Sandholm (2020).

Full Text

Tags

Comments