Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information

ICLR, 2020.

Cited by: 1|Bibtex|Views112
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We propose a new framework to develop efficient variants of Counterfactual regret minimization with an analysis shows that our algorithm is provably faster than the vanilla Counterfactual regret minimization

Abstract:

Counterfactual regret minimization (CFR) methods are effective for solving two-player zero-sum extensive games with imperfect information with state-of-the-art results. However, the vanilla CFR has to traverse the whole game tree in each round, which is time-consuming in large-scale games. In this paper, we present Lazy-CFR, a CFR algo...More

Code:

Data:

Highlights
  • Extensive games provide a mathematical framework for modeling the sequential decision-making problems with imperfect information, which is common in economic decisions, negotiations and security
  • We focus on solving two-player zero-sum extensive games with imperfect information (TEGI)
  • By comparing with the regret lower bound, we show that the regret upper bounds of Lazy-Counterfactual regret minimization and the vanilla Counterfactual regret minimization are near-optimal
  • We empirically evaluate Lazy-Counterfactual regret minimization, the vanilla Counterfactual regret minimization, MC-Counterfactual regret minimization (Lanctot et al, 2009), Counterfactual regret minimization+ (Bowling et al, 2017) and MC-Counterfactual regret minimization+ on the standard benchmarks, Leduc Hold’em (Southey et al, 2005) and heads-up flop hold’em poker (Brown et al, 2019)
  • As Counterfactual regret minimization employs regret matching or AdaHedge as a sub-procedure, we summarize them as follows: Definition 2 (Online linear optimization (OLO), regret matching (RM) and AdaHedge)
  • We propose a new framework to develop efficient variants of Counterfactual regret minimization with an analysis shows that our algorithm is provably faster than the vanilla Counterfactual regret minimization
Summary
  • Extensive games provide a mathematical framework for modeling the sequential decision-making problems with imperfect information, which is common in economic decisions, negotiations and security.
  • We define πσ(I) = h∈I πσ(h) as the probability of arriving at infoset I and let πσi (I) denote the corresponding contribution of player i.
  • We use lazy update to solve TEGIs. According to Eq (3), the regret minimization procedure can be divided into O(|I|) OLOs, one for each infoset.
  • We present one rule for Lazy-CFR, with which we can: 1) achieve a regret similar in amount to the regret of the vanilla CFR; 2) avoid updating the whole game tree; and 3) compute rj(I) efficiently.
  • Let τt(I) denote the last time step we update the strategy on infoset I before time t.
  • Let’s tentatively assume that we can compute mt and rj efficiently (See Sec. 3.2.1 for details), the convergence rate of Lazy-CFR depends on the total number of updates on strategy and its regret.
  • Let S1,t denote the set of histories such that if h ∈ S, the strategy on subt(h) is modified at round t and S2,t = {h : h ∈/ S1,t, pa(h) ∈ S1,t}.
  • To get Lazy-CFR+ and Lazy-LCFR, we only need to replace RM by the corresponding OLO solvers, and use their methods of computing time-averaged strategy as in (Bowling et al, 2017) and (Brown & Sandholm, 2019a) respectively.
  • We extend the regret analysis on the vanilla CFR in (Burch, 2018) to the members of CFR with lazy update.
  • The regret of CFR with lazy update can be bounded as RTi (σ) ≤ O( T η(σ)).
  • In the analysis of lower bound, we consider the standard adversarial setting in online learning, in which an adversary selects σt−i and a reward function uit : Z → [−1, 1] where Z is the set of terminal nodes in the infoset tree of player i.
  • For an algorithm A generating σti given the history in the past, let RTi ,A denote the regret of in the first rounds, we have limA→∞ limT →∞ minA maxπσ−ti,ut
  • Monte-Carlo based CFR (MC-CFR) (Lanctot et al, 2009; Burch N, 2012) uses Monte-Carlo sampling to avoid updating the strategy on infosets with small probability of arriving at.
  • Pruning-based variants (Brown & Sandholm, 2017a; 2015) skip the branches of the game tree if they do not affect the regret, but their performance can deteriorate to the vanilla CFR in the worst case.
  • Our analysis is essentially an extension of the regret analysis on the vanilla CFR in (Burch, 2018) to other variants of CFR with lazy-update.
  • It is worth of a systematical study to reduce the space complexity
Related work
  • Monte-Carlo and pruning-based methods: There are several variants of CFR which attempt to avoid traversing the whole game tree at each round. Monte-Carlo based CFR (MC-CFR) (Lanctot et al, 2009; Burch N, 2012) uses Monte-Carlo sampling to avoid updating the strategy on infosets with small probability of arriving at. Pruning-based variants (Brown & Sandholm, 2017a; 2015) skip the branches of the game tree if they do not affect the regret, but their performance can deteriorate to the vanilla CFR in the worst case. And dynamic thresholding (Brown et al, 2017) directly prunes the branches with small reach probabilities. In this work, we do not compare with pruning-based method (Brown & Sandholm, 2019b) since the pruning technique is orthogonal to lazy update.
Funding
  • This work was supported by the National Key Research and Development Program of China (No 2017YFA0700904), NSFC Projects (Nos. 61620106010, U19B2034, U1811461), Beijing NSF Project (No L172037), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, the JP Morgan Faculty Research Program and the NVIDIA NVAIL Program with GPU/DGX Acceleration
Full Text
Your rating :
0

 

Tags
Comments