# Near-Optimal Reinforcement Learning with Self-Play

NIPS 2020, 2020.

EI

Weibo:

Abstract:

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with $S$ states, $A$ max-player actions and $B$ min-player a...More

Code:

Data:

Introduction

- A wide range of modern artificial intelligence challenges can be cast as a multi-agent reinforcement learning problem, in which more than one agent performs sequential decision making in an interactive environment.
- Theorem 4 asserts that if the authors run the optimistic Nash Q-learning algorithm for more than O(H5SABι/ǫ2) episodes, the certified policies (μ, ν) extracted using Algorithm 2 will be ǫ-approximate Nash equilibrium (Definition 2).

Highlights

- A wide range of modern artificial intelligence challenges can be cast as a multi-agent reinforcement learning problem, in which more than one agent performs sequential decision making in an interactive environment
- The biggest AlphaGo Zero model is trained on tens of millions of games and took more than a month to train [31]. While requiring such amount of samples may be acceptable in simulatable environments such as GO, it is not so in other sample-expensive real world settings such as robotics and autonomous driving
- It is important for us to understand the sample complexity in RL—how can we design algorithms that find a near optimal policy with a small number of samples, and what is the fundamental limit, i.e. the minimum number of samples required for any algorithm to find a good policy
- We propose an optimistic variant of Nash Q-learning [11], and prove that it achieves sample complexity O(H5SAB/ǫ2) for finding an ǫ-approximate Nash equilibrium in two-player Markov games (Section 3)
- In Section 3 and 4, we present sharp guarantees for learning an approximate Nash equilibrium with near-optimal sample complexity
- Apart from finding Nash equilibria, we prove the fundamental hardness in computation for finding the best responses of fixed opponents, as well as achieving sublinear regret against adversarial opponents, in Markov games

Results

- Theorem 4 claims that if the authors run the optimistic Nash V-learning for more than O(H6S(A + B)ι/ǫ2) episodes, the certified policies (μ, ν) extracted from Algorithm 4 will be ǫ-approximate Nash (Definition 2).
- Nash V-learning is the first algorithm of which the sample complexity matches the information theoretical lower bound Ω(H3S(A + B)/ǫ2) up to poly(H) factors and logarithmic terms.
- The authors further show that this implies the computational hardness result for achieving sublinear regret in Markov games when playing against adversarial opponents, which rules out a popular approach to design algorithms for finding Nash equilibria.
- The authors first remark that if the opponent is restricted to only play Markov policies, learning the best response is as easy as learning a optimal policy in the standard single-agent Markov decision process, where efficient algorithms are known to exist.
- The intuitive reason for such computational hardness is that, while the underlying system has Markov transitions, the opponent can play policies that encode long-term correlations with non-Markovian nature, such as parity with noise, which makes it very challenging to find the best response.
- Similar to Theorem 6, Corollary 8 combined with Conjecture 7 demonstrates the fundamental difficulty of designing a polynomial time no-regret algorithm against adversarial opponents for Markov games.
- It is not hard to see that if the min-player runs a non-regret algorithm, and obtain a regret bound symmetric to (7), summing the two regret bounds shows the mixture policies (μ, ν)—which assigns uniform mixing weights to policies {μk}Kk=1 and {νk}Kk=1 respectively—is an approximate Nash equilibrium.

Conclusion

- The authors designed first line of near-optimal self-play algorithms for finding an approximate Nash equilibrium in two-player Markov games.
- Apart from finding Nash equilibria, the authors prove the fundamental hardness in computation for finding the best responses of fixed opponents, as well as achieving sublinear regret against adversarial opponents, in Markov games.

Summary

- A wide range of modern artificial intelligence challenges can be cast as a multi-agent reinforcement learning problem, in which more than one agent performs sequential decision making in an interactive environment.
- Theorem 4 asserts that if the authors run the optimistic Nash Q-learning algorithm for more than O(H5SABι/ǫ2) episodes, the certified policies (μ, ν) extracted using Algorithm 2 will be ǫ-approximate Nash equilibrium (Definition 2).
- Theorem 4 claims that if the authors run the optimistic Nash V-learning for more than O(H6S(A + B)ι/ǫ2) episodes, the certified policies (μ, ν) extracted from Algorithm 4 will be ǫ-approximate Nash (Definition 2).
- Nash V-learning is the first algorithm of which the sample complexity matches the information theoretical lower bound Ω(H3S(A + B)/ǫ2) up to poly(H) factors and logarithmic terms.
- The authors further show that this implies the computational hardness result for achieving sublinear regret in Markov games when playing against adversarial opponents, which rules out a popular approach to design algorithms for finding Nash equilibria.
- The authors first remark that if the opponent is restricted to only play Markov policies, learning the best response is as easy as learning a optimal policy in the standard single-agent Markov decision process, where efficient algorithms are known to exist.
- The intuitive reason for such computational hardness is that, while the underlying system has Markov transitions, the opponent can play policies that encode long-term correlations with non-Markovian nature, such as parity with noise, which makes it very challenging to find the best response.
- Similar to Theorem 6, Corollary 8 combined with Conjecture 7 demonstrates the fundamental difficulty of designing a polynomial time no-regret algorithm against adversarial opponents for Markov games.
- It is not hard to see that if the min-player runs a non-regret algorithm, and obtain a regret bound symmetric to (7), summing the two regret bounds shows the mixture policies (μ, ν)—which assigns uniform mixing weights to policies {μk}Kk=1 and {νk}Kk=1 respectively—is an approximate Nash equilibrium.
- The authors designed first line of near-optimal self-play algorithms for finding an approximate Nash equilibrium in two-player Markov games.
- Apart from finding Nash equilibria, the authors prove the fundamental hardness in computation for finding the best responses of fixed opponents, as well as achieving sublinear regret against adversarial opponents, in Markov games.

- Table1: Sample complexity (the required number of episodes) for algorithms to find ǫ-approximate Nash equlibrium policies in zero-sum Markov games
- Table2: Transition kernel of the hard instance
- Table3: Reward of the hard instance

Related work

- Markov games Markov games (or stochastic games) are proposed in the early 1950s [28]. They are widely used to model multi-agent RL. Learning the Nash equilibria of Markov games has been studied in classical work [18, 19, 11, 10], where the transition matrix and reward are assumed to be known, or in the asymptotic setting where the number of data goes to infinity. These results do not directly apply to the

Algorithm VI-ULCB [2] VI-explore [2] OMVI-SM [36] Optimistic Nash Q-learning Optimistic Nash V-learning Lower Bound [14, 2]

Sample Complexity O (H 4 S 2 AB /ǫ2 ) O (H 5 S 2 AB /ǫ2 ) O (H 4 S 3 A3 B 3 /ǫ2 ) O (H 5 S AB /ǫ2 )

O(H6S(A + B)/ǫ2) Ω(H3S(A + B)/ǫ2)

Runtime PPAD-complete

Polynomial non-asymptotic setting where the transition and reward are unknown and only a limited amount of data are available for estimating them.

A recent line of work tackles self-play algorithms for Markov games in the non-asymptotic setting with strong reachability assumptions. Specifically, Wei et al [35] assumes no matter what strategy one agent sticks to, the other agent can always reach all states by playing a certain policy, and Jia et al [13], Sidford et al [29] assume access to simulators (or generative models) that enable the agent to directly sample transition and reward information for any state-action pair. These settings ensure that all states can be reached directly, so no sophisticated exploration is not required.

Reference

- Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
- Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. arXiv preprint arXiv:2002.04017, 2020.
- Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkxpxJBKwS.
- Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Manuele Brambilla, Eliseo Ferrante, Mauro Birattari, and Marco Dorigo. Swarm robotics: a review from the swarm engineering perspective. Swarm Intelligence, 7(1):1–41, 2013.
- Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456): 885–890, 2019.
- Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
- Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516, 2019.
- Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science & Business Media, 2012.
- Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1):1–16, 2013.
- Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
- Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Zeyu Jia, Lin F Yang, and Mengdi Wang. Feature-based q-learning for two-player stochastic games. arXiv preprint arXiv:1906.00423, 2019.
- Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4868–4878, 2018.
- Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial markov decision processes with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192, 2019.
- Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
- Tor Lattimore and Csaba Szepesvari. Bandit algorithms. 2018.
- Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163.
- Michael L Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322– 328, 2001.
- Elchanan Mossel and Sebastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375, 2005.
- Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pages 3168–3176, 2015.
- OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
- Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
- Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
- Goran Radanovic, Rati Devidze, David Parkes, and Adish Singla. Learning to collaborate in markov decision processes. In International Conference on Machine Learning, pages 5261–5270, 2019.
- Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. arXiv preprint arXiv:1905.07773, 2019.
- Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
- Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095– 1100, 1953.
- Aaron Sidford, Mengdi Wang, Lin F Yang, and Yinyu Ye. Solving discounted stochastic two-player games with near-optimal time and sample complexity. arXiv preprint arXiv:1908.11071, 2019.
- David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. PAC modelfree reinforcement learning. In International Conference on Machine Learning, pages 881–888, 2006.
- Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michael Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.
- Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems, pages 4987–4997, 2017.
- Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneousmove markov games using function approximation and correlated equilibrium. arXiv preprint arXiv:2002.07066, 2020.
- Yasin Abbasi Yadkori, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvari. Online learning in markov decision processes with adversarially chosen transition probability distributions. In Advances in neural information processing systems, pages 2508–2516, 2013.
- Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in neural information processing systems, pages 1583– 1591, 2013.
- 0. Similar to what we have done in Nash
- 0. Defining the sum we have

Tags

Comments