# Provable Self-Play Algorithms for Competitive Reinforcement Learning

ICML, pp. 551-560, 2020.

EI

Weibo:

Abstract:

Self-play, where the algorithm learns by playing against itself without requiring any direct supervision, has become the new weapon in modern Reinforcement Learning (RL) for achieving superhuman performance in practice. However, the majority of exisiting theory in reinforcement learning only applies to the setting where the agent plays ...More

Code:

Data:

Introduction

- This paper studies competitive reinforcement learning, that is, reinforcement learning with two or more agents taking actions simultaneously, but each maximizing their own reward.
- A key highlight in their approaches is the successful use of self-play for achieving super-human performance in absence of human knowledge or expert opponents.
- These self-play algorithms are able to learn a good policy for all players from scratch through repeatedly playing the current policies against each other and performing policy updates using these self-played game trajectories.
- The empirical success of self-play has challenged the conventional wisdom that expert opponents are necessary for achieving good performance, and calls for a better theoretical understanding

Highlights

- To the best of our knowledge, our work presents the first line of provably sample-efficient self-play algorithms for competitive reinforcement learning
- This paper studies competitive reinforcement learning, that is, reinforcement learning with two or more agents taking actions simultaneously, but each maximizing their own reward
- The goal of this paper is to design low-regret algorithms for solving episodic two-player Markov games in the general setting (Kearns and Singh, 2002), that is, the algorithm is allowed to play the game for a fixed amount of episodes using arbitrary policies, and its performance is measured in terms of the regret
- We studied the sample complexity of finding the equilibrium policy in the setting of competitive reinforcement learning, i.e. zero-sum Markov games with two players
- Towards investigating the optimal runtime and sample complexity in two-player games, we provided accompanying results showing that (1) the computational efficiency of our algorithm can be improved by explore--exploit type algorithms, which has a slightly worse regret; (2) the state and action space dependence in the regret can be reduced in the special case of one-step games via alternative mirror descent type algorithms. a coWmpeubtealtiieovnealtlhyisefpfiacpieernot palegnosruitphmmsanthyaitnatcehreiesvtiensgOdi(r√ecTti)onresgfroert?fuWtuhreatwaorerkth. eFoorpteixmaaml pdleep, ecnadnewnceedoefsitghne regret on (S, A, B) in multi-step games? the present results only work in tabular games, and it would be of interest to investigate if similar results can hold in presence of function approximation

Results

- Algorithm that wacehpierveessenOto(√urTal)groergitrhemt inanMd marakionvthGeaomreems.s.

In particular, the algorithm The authors describe the algorithm is in the first Section self-play 3.1, and present its theoretical guarantee for general Markov games in Section 3.2. - To solve zero-sum Markov games, the main idea is to extend the celebrated UCB (Upper Confidence Bounds) principle—an algorithmic principle that achieves provably efficient exploration in bandits (Auer et al, 2002) and single-agent RL (Azar et al, 2017; Jin et al, 2018)—to the two-player setting.
- It seems natural here to maintain two sets of Q estimates, one upper bounding the true value and one lower bounding the true value, so that each player can play optimistically with respect to her own goal
- The authors summarize this idea into the following proposal

Conclusion

- The authors studied the sample complexity of finding the equilibrium policy in the setting of competitive reinforcement learning, i.e. zero-sum Markov games with two players.
- The authors designed a self-play algorithm for zero-sum games and showed that it can efficiently find the Nash equilibrium policy in the exploration setting through establishing a regret bound.
- Towards investigating the optimal runtime and sample complexity in two-player games, the authors provided accompanying results showing that (1) the computational efficiency of the algorithm can be improved by explore--exploit type algorithms, which has a slightly worse regret; (2) the state and action space dependence in the regret can be reduced in the special case of one-step games via alternative mirror descent type algorithms.
- Towards investigating the optimal runtime and sample complexity in two-player games, the authors provided accompanying results showing that (1) the computational efficiency of the algorithm can be improved by explore--exploit type algorithms, which has a slightly worse regret; (2) the state and action space dependence in the regret can be reduced in the special case of one-step games via alternative mirror descent type algorithms. a coWmpeubtealtiieovnealtlhyisefpfiacpieernot palegnosruitphmmsanthyaitnatcehreiesvtiensgOdi(r√ecTti)onresgfroert?fuWtuhreatwaorerkth. eFoorpteixmaaml pdleep, ecnadnewnceedoefsitghne regret on (S, A, B) in multi-step games? the present results only work in tabular games, and it would be of interest to investigate if similar results can hold in presence of function approximation

Summary

## Introduction:

This paper studies competitive reinforcement learning, that is, reinforcement learning with two or more agents taking actions simultaneously, but each maximizing their own reward.- A key highlight in their approaches is the successful use of self-play for achieving super-human performance in absence of human knowledge or expert opponents.
- These self-play algorithms are able to learn a good policy for all players from scratch through repeatedly playing the current policies against each other and performing policy updates using these self-played game trajectories.
- The empirical success of self-play has challenged the conventional wisdom that expert opponents are necessary for achieving good performance, and calls for a better theoretical understanding
## Objectives:

The goal of this paper is to design low-regret algorithms for solving episodic two-player Markov games in the general setting (Kearns and Singh, 2002), that is, the algorithm is allowed to play the game for a fixed amount of episodes using arbitrary policies, and its performance is measured in terms of the regret.## Results:

Algorithm that wacehpierveessenOto(√urTal)groergitrhemt inanMd marakionvthGeaomreems.s.

In particular, the algorithm The authors describe the algorithm is in the first Section self-play 3.1, and present its theoretical guarantee for general Markov games in Section 3.2.- To solve zero-sum Markov games, the main idea is to extend the celebrated UCB (Upper Confidence Bounds) principle—an algorithmic principle that achieves provably efficient exploration in bandits (Auer et al, 2002) and single-agent RL (Azar et al, 2017; Jin et al, 2018)—to the two-player setting.
- It seems natural here to maintain two sets of Q estimates, one upper bounding the true value and one lower bounding the true value, so that each player can play optimistically with respect to her own goal
- The authors summarize this idea into the following proposal
## Conclusion:

The authors studied the sample complexity of finding the equilibrium policy in the setting of competitive reinforcement learning, i.e. zero-sum Markov games with two players.- The authors designed a self-play algorithm for zero-sum games and showed that it can efficiently find the Nash equilibrium policy in the exploration setting through establishing a regret bound.
- Towards investigating the optimal runtime and sample complexity in two-player games, the authors provided accompanying results showing that (1) the computational efficiency of the algorithm can be improved by explore--exploit type algorithms, which has a slightly worse regret; (2) the state and action space dependence in the regret can be reduced in the special case of one-step games via alternative mirror descent type algorithms.
- Towards investigating the optimal runtime and sample complexity in two-player games, the authors provided accompanying results showing that (1) the computational efficiency of the algorithm can be improved by explore--exploit type algorithms, which has a slightly worse regret; (2) the state and action space dependence in the regret can be reduced in the special case of one-step games via alternative mirror descent type algorithms. a coWmpeubtealtiieovnealtlhyisefpfiacpieernot palegnosruitphmmsanthyaitnatcehreiesvtiensgOdi(r√ecTti)onresgfroert?fuWtuhreatwaorerkth. eFoorpteixmaaml pdleep, ecnadnewnceedoefsitghne regret on (S, A, B) in multi-step games? the present results only work in tabular games, and it would be of interest to investigate if similar results can hold in presence of function approximation

- Table1: Regret and PAC guarantees of the Algorithms in this paper for zero-sum Markov games

Related work

- There is a fast-growing body of work on multi-agent reinforcement learning (MARL). Many of them achieve striking empirical performance, or attack MARL in the cooperative setting, where agents are optimizing for a shared or similar reward. We refer the readers to several recent surveys for these results (see e.g. Busoniu et al, 2010; Nguyen et al, 2018; OroojlooyJadid and Hajinezhad, 2019; Zhang et al, 2019). In the rest of this section we focus on theoretical results related to competitive RL.

Markov games Markov games (or stochastic games) is proposed as a mathematical model for compeitive RL back in the early 1950s (Shapley, 1953). There is a long line of classical work since then on solving this problem (see e.g. Littman, 1994, 2001; Hu and Wellman, 2003; Hansen et al, 2013). They design algorithms, possibly with runtime guarantees, to find optimal policies in Markov games when both the transition matrix and reward are known, or in the asymptotic setting where number of data goes to infinity. These results do not directly apply to the non-asymptotic setting where the transition and reward are unknown and only a limited amount of data are available for estimating them.

Reference

- Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–27JMLR. org, 2017.
- Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Reinforcement learning under drift. arXiv preprint arXiv:1906.02922, 2019.
- Jacob W Crandall and Michael A Goodrich. Learning to compete, compromise, and cooperate in repeated general-sum games. In Proceedings of the 22nd international conference on Machine learning, pages 161–168, 2005.
- Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713– 5723, 2017.
- Constantinos Daskalakis. On the complexity of approximating a nash equilibrium. ACM Transactions on Algorithms (TALG), 9(3):23, 2013.
- Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science & Business Media, 2012.
- Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60 (1):1–16, 2013.
- Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
- Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Zeyu Jia, Lin F Yang, and Mengdi Wang. Feature-based q-learning for two-player stochastic games. arXiv preprint arXiv:1906.00423, 2019.
- Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4868–4878, 2018.
- Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial markov decision processes with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192, 2019.
- Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. arXiv preprint arXiv:2002.02794, 2020.
- Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209–232, 2002.
- Daphne Koller. Fast algorithms for finding randomized strategies in game trees* daphne koller daphne@ cs. berkeley. edu. Computing, 750:759, 1994.
- Tor Lattimore and Csaba Szepesvari. Bandit algorithms. 2018.
- Carlton E Lemke and Joseph T Howson, Jr. Equilibrium points of bimatrix games. Journal of the Society for industrial and Applied Mathematics, 12(2):413–423, 1964.
- Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163.
- Michael L Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322–328, 2001.
- Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Corruption robust exploration in episodic reinforcement learning. arXiv preprint arXiv:1911.08689, 2019.
- John Nash. Non-cooperative games. Annals of mathematics, pages 286–295, 1951.
- Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. Deep reinforcement learning for multi-agent systems: A review of challenges, solutions and applications. arXiv preprint arXiv:1812.11794, 2018.
- OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
- Afshin OroojlooyJadid and Davood Hajinezhad. A review of cooperative multi-agent deep reinforcement learning. arXiv preprint arXiv:1908.03963, 2019.
- Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
- Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
- Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066–3074, 2013.
- Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. arXiv preprint arXiv:1905.07773, 2019.
- Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial bandits. In ICML, pages 1287–1295, 2014.
- Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
- Aaron Sidford, Mengdi Wang, Lin F Yang, and Yinyu Ye. Solving discounted stochastic two-player games with near-optimal time and sample complexity. arXiv preprint arXiv:1908.11071, 2019.
- David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, pages 881–888, 2006.
- Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michael Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- John von Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
- Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems, pages 4987–4997, 2017.
- Jia Yuan Yu and Shie Mannor. Arbitrarily modulated markov decision processes. In Proceedings of the 48h IEEE Conference on Decision and Control, pages 2946–2953, 2009.
- Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. arXiv preprint arXiv:1901.00210, 2019.
- Kaiqing Zhang, Zhuoran Yang, and Tamer Basar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635, 2019.
- Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in neural information processing systems, pages 1583–1591, 2013.
- 0. This finishes the proof. The proof is based on a standard online-to-batch conversion (e.g. (Section 3.1, Jin et al., 2018).) Let (μk, νk) denote the policies deployed by the VI-ULCB algorithm in episode k. We sample μ, ν uniformly as μ ∼ Unif μ1,..., μK and ν ∼ Unif ν1,..., νK.
- The theorem is almost an immediate consequence of the general result on mirror descent (Rakhlin and Sridharan, 2013). However, for completeness, we provide a self-contained proof here. The main ingredient in our proof is to show that a “natural” loss estimator satisfies desirable properties—such as unbiasedness and bounded variance—for the standard analysis of mirror descent type algorithms to go through.
- (3) Unbiased estimate of lk(a). For any fixed state a ∈ A, we have
- (4) Bounded variance: one can check that
- (3) Unbiased estimate of Q·1,νk (s1, ·). For any fixed state a, when ak = a happens, sk2 is drawn from the MDP transition P1(·|s1, a). Therefore, letting Fk−1 be the σ-algebra that encodes all the information observed at the end of episode k − 1, we have that
- (4) Bounded variance: one can check that

Tags

Comments