## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning.

ICML, (2017)

EI

关键词

摘要

Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods typically scale poorly in the problem size. Therefore, a key challenge is to translate the success of deep learning on single-agent RL to th...更多

代码：

数据：

简介

- Reinforcement learning (RL), which enables an agent to learn control policies on-line given only sequences of observations and rewards, has emerged as a dominant paradigm for training autonomous systems.
- Many real-world problems, such as network packet delivery (Ye et al, 2015), rubbish removal (Makar et al, 2001), and urban traffic control (Kuyer et al, 2008; Van der Pol & Oliehoek, 2016), are naturally modeled as cooperative multi-agent systems.
- The size of this meta-agent’s action space grows exponentially in the number of agents
- It is not applicable when each agent receives different observations that may not disambiguate the state, in which case decentralised policies must be learned.
- The parameters θ are learned by sampling batches of b transitions from the replay memory, and minimising the squared TD-error: b

重点内容

- Reinforcement learning (RL), which enables an agent to learn control policies on-line given only sequences of observations and rewards, has emerged as a dominant paradigm for training autonomous systems
- Previous work on deep multi-agent RL has limited the use of experience replay to short, recent buffers (Leibo et al, 2017) or disabled replay altogether (Foerster et al, 2016)
- To avoid the difficulty of combining independent Q-learning (IQL) with experience replay, previous work on deep multi-agent RL has limited the use of experience replay to short, recent buffers (Leibo et al, 2017) or disabled replay altogether (Foerster et al, 2016)
- It is easier to overfit to single observations, and experience replay is more essential for a feed-forward model
- This paper proposed two methods for stabilising experience replay in deep multi-agent reinforcement learning: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent’s value function on a fingerprint that disambiguates the age of the data sampled from the replay memory
- Results on a challenging decentralised variant of StarCraft unit micromanagement confirmed that these methods enable the successful combination of experience replay with multiple agents

方法

**Methods like hyper**

Q-learning (Tesauro, 2003), discussed in Section 3.2, and AWESOME (Conitzer & Sandholm, 2007) try to tackle nonstationarity by tracking and conditioning each agent’s learning process on their teammates’ current policy, while Da Silva et al (2006) propose detecting and tracking different classes of traces on which to condition policy learning. Kok & Vlassis (2006) show that coordination can be learnt by estimating a global Q-function in the classical distributed setting supplemented with a coordination graph.- Leibo et al (2017) analyse the emergence of cooperation and defection when using multiagent RL in mixed-cooperation environments such as the wolfpack problem
- He et al (2016) address multi-agent learning by explicitly marginalising the opponents’ strategy using a mixture of experts in the DQN.
- To avoid the difficulty of combining IQL with experience replay, previous work on deep multi-agent RL has limited the use of experience replay to short, recent buffers (Leibo et al, 2017) or disabled replay altogether (Foerster et al, 2016).
- The derivation of the non-stationary parts of the Bellman equation in the partially observable multi-agent setting is considerably more complex as the agents’ actionobservation histories are correlated in a complex fashion that depends on the agents’ policies as well as the transition and observation functions

结果

- The authors present the results of the StarCraft experiments, summarised in Figure 2.
- When exploratory actions do occur, agents visit areas of the state space that have not had their Q-values updated for many iterations, and bootstrap off of values which have become stale or distorted by updates to the Q-function elsewhere.
- This effect can harm or destabilise the policy.
- It is easier to overfit to single observations, and experience replay is more essential for a feed-forward model

结论

- This paper proposed two methods for stabilising experience replay in deep multi-agent reinforcement learning: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent’s value function on a fingerprint that disambiguates the age of the data sampled from the replay memory.
- The authors would like to apply these methods to a broader range of nonstationary training problems, such as classification on changing data, and extend them to multiagent actor-critic methods

相关工作

- Multi-agent RL has a rich history (Busoniu et al, 2008; Yang & Gu, 2004) but has mostly focused on tabular settings and simple environments. The most commonly used method is independent Q-learning (Tan, 1993; Shoham & Leyton-Brown, 2009; Zawadzki et al, 2014), which we discuss further in Section 3.2.

基金

- This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement #637713)
- This work was also supported by the OxfordGoogle DeepMind Graduate Scholarship, the Microsoft Research PhD Scholarship Program, EPSRC AIMS CDT grant EP/L015987/1, ERC grant ERC-2012-AdG 321162HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1

引用论文

- Busoniu, Lucian, Babuska, Robert, and De Schutter, Bart. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems Man and Cybernetics Part C Applications and Reviews, 38(2):156, 2008.
- Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, and Bengio, Yoshua. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- Ciosek, Kamil and Whiteson, Shimon. Offer: Offenvironment reinforcement learning. 2017.
- Collobert, R., Kavukcuoglu, K., and Farabet, C. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.
- Conitzer, Vincent and Sandholm, Tuomas. Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1-2):23–43, 2007.
- Da Silva, Bruno C, Basso, Eduardo W, Bazzan, Ana LC, and Engel, Paulo M. Dealing with non-stationary environments using context detection. In Proceedings of the 23rd international conference on Machine learning, pp. 217–224. ACM, 2006.
- Foerster, Jakob, Assael, Yannis M, de Freitas, Nando, and Whiteson, Shimon. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.
- Hausknecht, Matthew and Stone, Peter. Deep recurrent qlearning for partially observable mdps. arXiv preprint arXiv:1507.06527, 2015.
- Hausknecht, Matthew, Mupparaju, Prannoy, Subramanian, Sandeep, Kalyanakrishnan, S, and Stone, P. Half field offense: an environment for multiagent learning and ad hoc teamwork. In AAMAS Adaptive Learning Agents (ALA) Workshop, 2016.
- He, He, Boyd-Graber, Jordan, Kwok, Kevin, and Daume III, Hal. Opponent modeling in deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1804–1813, 2016.
- Hochreiter, Sepp and Schmidhuber, Jurgen. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
- Jorge, Emilio, Kageback, Mikael, and Gustavsson, Emil. Learning to play guess who? and inventing a grounded language as a consequence. arXiv preprint arXiv:1611.03218, 2016.
- Kok, Jelle R and Vlassis, Nikos. Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research, 7(Sep):1789–1828, 2006.
- Kuyer, Lior, Whiteson, Shimon, Bakker, Bram, and Vlassis, Nikos. Multiagent reinforcement learning for urban traffic control using coordination graphs. In ECML 2008: Proceedings of the Nineteenth European Conference on Machine Learning, pp. 656–671, September 2008.
- Lauer, Martin and Riedmiller, Martin. An algorithm for distributed reinforcement learning in cooperative multiagent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer, 2000.
- Leibo, Joel Z, Zambaldi, Vinicius, Lanctot, Marc, Marecki, Janusz, and Graepel, Thore. Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037, 2017.
- Makar, Rajbala, Mahadevan, Sridhar, and Ghavamzadeh, Mohammad. Hierarchical multi-agent reinforcement learning. In Proceedings of the fifth international conference on Autonomous agents, pp. 246–253. ACM, 2001.
- Mataric, Maja J. Using communication to reduce locality in distributed multiagent learning. Journal of experimental & theoretical artificial intelligence, 10(3):357–369, 1998.
- Matignon, Laetitia, Laurent, Guillaume J, and Le Fort-Piat, Nadine. Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. The Knowledge Engineering Review, 27(01): 1–31, 2012.
- Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529– 533, 2015.
- Robert, CP and Casella, G. Monte carlo statistical methods springer. New York, 2004.
- Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Silver, David. Prioritized experience replay. CoRR, abs/1511.05952, 2015.
- Shoham, Y. and Leyton-Brown, K. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, New York, 2009.
- Sukhbaatar, Sainbayar, Fergus, Rob, et al. Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems, pp. 2244–2252, 2016.
- Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- Wang, Ziyu, Bapst, Victor, Heess, Nicolas, Mnih, Volodymyr, Munos, Remi, Kavukcuoglu, Koray, and de Freitas, Nando. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
- Watkins, Christopher John Cornish Hellaby. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
- Yang, Erfu and Gu, Dongbing. Multiagent reinforcement learning for multi-robot systems: A survey. Technical report, tech. rep, 2004.
- Ye, Dayong, Zhang, Minjie, and Yang, Yun. A multiagent framework for packet routing in wireless sensor networks. sensors, 15(5):10026–10047, 2015.
- Zawadzki, E., Lipson, A., and Leyton-Brown, K. Empirically evaluating multiagent learning algorithms. arXiv preprint 1401.8074, 2014.
- Synnaeve, Gabriel, Nardelli, Nantas, Auvolat, Alex, Chintala, Soumith, Lacroix, Timothee, Lin, Zeming, Richoux, Florian, and Usunier, Nicolas. Torchcraft: a library for machine learning research on real-time strategy games. arXiv preprint arXiv:1611.00625, 2016.
- Tampuu, Ardi, Matiisen, Tambet, Kodelja, Dorian, Kuzovkin, Ilya, Korjus, Kristjan, Aru, Juhan, Aru, Jaan, and Vicente, Raul. Multiagent cooperation and competition with deep reinforcement learning. arXiv preprint arXiv:1511.08779, 2015.
- Tan, Ming. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330– 337, 1993.
- Tesauro, Gerald. Extending q-learning to general adaptive multi-agent systems. In NIPS, volume 4, 2003.
- Usunier, Nicolas, Synnaeve, Gabriel, Lin, Zeming, and Chintala, Soumith. Episodic exploration for deep deterministic policies: An application to starcraft micromanagement tasks. arXiv preprint arXiv:1609.02993, 2016.
- Van der Pol, Elise and Oliehoek, Frans A. Coordinated deep reinforcement learners for traffic light control. In NIPS’16 Workshop on Learning, Inference and Control of Multi-Agent Systems, 2016.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn