Posterior sampling for multi-agent reinforcement learning: solving extensive games with imperfect information

Yichi Zhou
Yichi Zhou
Jialian Li
Jialian Li

ICLR, 2020.

Cited by: 2|Bibtex|Views84
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We consider the problem of posterior sampling for twoplayer zero-sum extensive-games with imperfect information, which is a class of multiagent reinforcement learning problems

Abstract:

Posterior sampling for reinforcement learning (PSRL) is a useful framework for making decisions in an unknown environment. PSRL maintains a posterior distribution of the environment and then makes planning on the environment sampled from the posterior distribution. Though PSRL works well on single-agent reinforcement learning problems, h...More

Code:

Data:

Introduction
  • Reinforcement Learning (RL) (Sutton & Barto, 2018) provides a framework for decision-making problems in an unknown environment, such as robotics control.
  • One typical target in designing RL algorithms is to reduce the number of interactions needed to find good strategies.
  • Posterior sampling for RL (PSRL) (Strens, 2000) provides a useful framework for deciding how to interact with the environment.
  • In a single-agent RL (SARL) problem, PSRL takes the strategy with the maximum expected reward on the sampled environment as the interaction strategy (Osband et al, 2013).
  • PSRL is a Bayesian-style algorithm, empirical evaluation (Chapelle & Li, 2011) and theoretical analysis on the multi-armed bandit problems (Agrawal & Goyal, 2017) suggest that it enjoys good performance for a problem with fixed parameters
Highlights
  • Reinforcement Learning (RL) (Sutton & Barto, 2018) provides a framework for decision-making problems in an unknown environment, such as robotics control
  • We consider the problem of posterior sampling for twoplayer zero-sum extensive-games with imperfect information, which is a class of multiagent reinforcement learning problems
  • Considering one sample from the prior, Frequentists’ methods such as UCBVI (Azar et al, 2013) give a high probability regret bound for single-agent RL of a similar order to Posterior sampling for reinforcement learning
  • Another direction is that our method heavily relies on the structure of twoplayer zero-sum extensive-games with imperfect information and the solution concept of Nash Equilibrium
  • Further work is needed to extend posterior sampling to more complicated multi-agent systems, such as stochastic games (Littman, 1994) and extensive games with more than two players
  • The generalization for Posterior sampling for reinforcement learning is another important but challenging future work direction. It is worth of a systematic investigation to bridge the gap between the provable tabular Reinforcement Learning algorithms and Posterior sampling for reinforcement learning methods with generalization
Methods
  • The authors formally present the method, which conjoins the merits of PSRL and CFR and can efficiently compute the approximate NE for TEGI tasks.
  • The authors given an overview of the algorithm.
  • The authors sample a dt from posterior distribution Pt and apply CFR to dt to get a policy tuple.
  • The authors sample another dt to calculate the interaction strategies.
  • The authors' algorithm can converge to the NE at a rate of O( log(T )/T ).
  • The time complexity of computing the interaction strategies is linear to |H|
Conclusion
  • CONCLUSIONS AND DISCUSSIONS

    In this work, the authors consider the problem of posterior sampling for TEGIs, which is a class of multiagent reinforcement learning problems.
  • Though it is possible that the method has a better performance on a specific TEGI than the bound in Theorem 1, the algorithm is very possibly not the best in the sense of problem-dependent performance.
  • Osband et al (2016) applies the principle of PSRL to DQN by using bootstrapping
  • Another possible direction is to adapt more practical Bayesian inference algorithms to RL tasks
Summary
  • Introduction:

    Reinforcement Learning (RL) (Sutton & Barto, 2018) provides a framework for decision-making problems in an unknown environment, such as robotics control.
  • One typical target in designing RL algorithms is to reduce the number of interactions needed to find good strategies.
  • Posterior sampling for RL (PSRL) (Strens, 2000) provides a useful framework for deciding how to interact with the environment.
  • In a single-agent RL (SARL) problem, PSRL takes the strategy with the maximum expected reward on the sampled environment as the interaction strategy (Osband et al, 2013).
  • PSRL is a Bayesian-style algorithm, empirical evaluation (Chapelle & Li, 2011) and theoretical analysis on the multi-armed bandit problems (Agrawal & Goyal, 2017) suggest that it enjoys good performance for a problem with fixed parameters
  • Methods:

    The authors formally present the method, which conjoins the merits of PSRL and CFR and can efficiently compute the approximate NE for TEGI tasks.
  • The authors given an overview of the algorithm.
  • The authors sample a dt from posterior distribution Pt and apply CFR to dt to get a policy tuple.
  • The authors sample another dt to calculate the interaction strategies.
  • The authors' algorithm can converge to the NE at a rate of O( log(T )/T ).
  • The time complexity of computing the interaction strategies is linear to |H|
  • Conclusion:

    CONCLUSIONS AND DISCUSSIONS

    In this work, the authors consider the problem of posterior sampling for TEGIs, which is a class of multiagent reinforcement learning problems.
  • Though it is possible that the method has a better performance on a specific TEGI than the bound in Theorem 1, the algorithm is very possibly not the best in the sense of problem-dependent performance.
  • Osband et al (2016) applies the principle of PSRL to DQN by using bootstrapping
  • Another possible direction is to adapt more practical Bayesian inference algorithms to RL tasks
Related work
  • Other methods for TEGIs under unknown environment: There also exist some works on TEGIs under an unknown environment. Fictitious play (FP) (Brown, 1951) is another popular algorithm for approximating NE. In FP, the agent takes the best response to the average strategy of its opponent. Heinrich et al (2015) extend FP to TEGIs. Though it may be easier to combine FP with other machine learning techniques than CFR, when the chance player is known, the convergence rate of FP is usually worse than CFR variants. Monte Carlo CFR with outcome sampling (MCCFR-OS) (Lanctot et al, 2009) can also be applied to TEGIs to approximate NE in a model-free style. It uses Monte Carlo estimates of the environment to conduct CFR and can converge to the NE. Since it is updated without a model of the environment, it is much less efficient than model-based methods. There is also work that applies SARL methods to TEGIs. For example, Srinivasan et al (2018) adapts actor-critic to games in a model-free style.
Funding
  • This work was supported by the National Key Research and Development Program of China (No 2017YFA0700904), NSFC Projects (Nos. 61620106010, U19B2034, U1811461), Beijing NSF Project (No L172037), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, the JP Morgan Faculty Research Program and the NVIDIA NVAIL Program with GPU/DGX Acceleration
Reference
  • Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5):30, 2017.
    Google ScholarLocate open access versionFindings
  • Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3): 325–349, 2013.
    Google ScholarLocate open access versionFindings
  • George W Brown. Iterative solution of games by fictitious play, 1951. Activity Analysis of Production and Allocation (TC Koopmans, Ed.), pp. 374–376, 1951.
    Google ScholarFindings
  • Neil Burch. Time and space: Why imperfect information games are hard. 2018.
    Google ScholarFindings
  • Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249–2257, 2011.
    Google ScholarLocate open access versionFindings
  • Richard Ericson and Ariel Pakes. Markov-perfect industry dynamics: A framework for empirical work. The Review of economic studies, 62(1):53–82, 1995.
    Google ScholarLocate open access versionFindings
  • Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious self-play in extensive-form games. In International Conference on Machine Learning, pp. 805–813, 2015.
    Google ScholarLocate open access versionFindings
  • Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
    Google ScholarLocate open access versionFindings
  • Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
    Google ScholarLocate open access versionFindings
  • Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte carlo sampling for regret minimization in extensive games. In Advances in neural information processing systems, pp. 1078–1086, 2009.
    Google ScholarLocate open access versionFindings
  • Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163.
    Google ScholarLocate open access versionFindings
  • Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? arXiv preprint arXiv:1607.00215, 2016.
    Findings
  • Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003–3011, 2013.
    Google ScholarLocate open access versionFindings
  • Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034, 2016.
    Google ScholarLocate open access versionFindings
  • By directly applying the result in (Burch, 2018), we can upper bound the CFR regret with
    Google ScholarFindings
  • First we consider the second one |uit(zi,t) − ui(zi,t|r∗)| and we have similar bound for the first term. Conditioning on r∗(zi,t), we can apply the Chernoff-Hoeffding bound (Hoeffding, 1994). For δ ∈ (0, 1)
    Google ScholarFindings
  • Then conditioning on c∗(hCj,t, a), we use the concentration bound for L1 norm (i.e. the deviation inequality (Weissman et al., 2003) to get that for δ ∈ (0, 1)
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments