AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent's value function on a fingerprint that disambiguates the age of the data sampled from the replay memory

Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning.

ICML, (2017)

引用459|浏览513
EI
下载 PDF 全文
引用
微博一下

摘要

Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods typically scale poorly in the problem size. Therefore, a key challenge is to translate the success of deep learning on single-agent RL to th...更多

代码

数据

0
简介
  • Reinforcement learning (RL), which enables an agent to learn control policies on-line given only sequences of observations and rewards, has emerged as a dominant paradigm for training autonomous systems.
  • Many real-world problems, such as network packet delivery (Ye et al, 2015), rubbish removal (Makar et al, 2001), and urban traffic control (Kuyer et al, 2008; Van der Pol & Oliehoek, 2016), are naturally modeled as cooperative multi-agent systems.
  • The size of this meta-agent’s action space grows exponentially in the number of agents
  • It is not applicable when each agent receives different observations that may not disambiguate the state, in which case decentralised policies must be learned.
  • The parameters θ are learned by sampling batches of b transitions from the replay memory, and minimising the squared TD-error: b
重点内容
  • Reinforcement learning (RL), which enables an agent to learn control policies on-line given only sequences of observations and rewards, has emerged as a dominant paradigm for training autonomous systems
  • Previous work on deep multi-agent RL has limited the use of experience replay to short, recent buffers (Leibo et al, 2017) or disabled replay altogether (Foerster et al, 2016)
  • To avoid the difficulty of combining independent Q-learning (IQL) with experience replay, previous work on deep multi-agent RL has limited the use of experience replay to short, recent buffers (Leibo et al, 2017) or disabled replay altogether (Foerster et al, 2016)
  • It is easier to overfit to single observations, and experience replay is more essential for a feed-forward model
  • This paper proposed two methods for stabilising experience replay in deep multi-agent reinforcement learning: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent’s value function on a fingerprint that disambiguates the age of the data sampled from the replay memory
  • Results on a challenging decentralised variant of StarCraft unit micromanagement confirmed that these methods enable the successful combination of experience replay with multiple agents
方法
  • Methods like hyper

    Q-learning (Tesauro, 2003), discussed in Section 3.2, and AWESOME (Conitzer & Sandholm, 2007) try to tackle nonstationarity by tracking and conditioning each agent’s learning process on their teammates’ current policy, while Da Silva et al (2006) propose detecting and tracking different classes of traces on which to condition policy learning. Kok & Vlassis (2006) show that coordination can be learnt by estimating a global Q-function in the classical distributed setting supplemented with a coordination graph.
  • Leibo et al (2017) analyse the emergence of cooperation and defection when using multiagent RL in mixed-cooperation environments such as the wolfpack problem
  • He et al (2016) address multi-agent learning by explicitly marginalising the opponents’ strategy using a mixture of experts in the DQN.
  • To avoid the difficulty of combining IQL with experience replay, previous work on deep multi-agent RL has limited the use of experience replay to short, recent buffers (Leibo et al, 2017) or disabled replay altogether (Foerster et al, 2016).
  • The derivation of the non-stationary parts of the Bellman equation in the partially observable multi-agent setting is considerably more complex as the agents’ actionobservation histories are correlated in a complex fashion that depends on the agents’ policies as well as the transition and observation functions
结果
  • The authors present the results of the StarCraft experiments, summarised in Figure 2.
  • When exploratory actions do occur, agents visit areas of the state space that have not had their Q-values updated for many iterations, and bootstrap off of values which have become stale or distorted by updates to the Q-function elsewhere.
  • This effect can harm or destabilise the policy.
  • It is easier to overfit to single observations, and experience replay is more essential for a feed-forward model
结论
  • This paper proposed two methods for stabilising experience replay in deep multi-agent reinforcement learning: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent’s value function on a fingerprint that disambiguates the age of the data sampled from the replay memory.
  • The authors would like to apply these methods to a broader range of nonstationary training problems, such as classification on changing data, and extend them to multiagent actor-critic methods
相关工作
基金
  • This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement #637713)
  • This work was also supported by the OxfordGoogle DeepMind Graduate Scholarship, the Microsoft Research PhD Scholarship Program, EPSRC AIMS CDT grant EP/L015987/1, ERC grant ERC-2012-AdG 321162HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1
引用论文
  • Busoniu, Lucian, Babuska, Robert, and De Schutter, Bart. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems Man and Cybernetics Part C Applications and Reviews, 38(2):156, 2008.
    Google ScholarLocate open access versionFindings
  • Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, and Bengio, Yoshua. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
    Findings
  • Ciosek, Kamil and Whiteson, Shimon. Offer: Offenvironment reinforcement learning. 2017.
    Google ScholarFindings
  • Collobert, R., Kavukcuoglu, K., and Farabet, C. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.
    Google ScholarLocate open access versionFindings
  • Conitzer, Vincent and Sandholm, Tuomas. Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1-2):23–43, 2007.
    Google ScholarLocate open access versionFindings
  • Da Silva, Bruno C, Basso, Eduardo W, Bazzan, Ana LC, and Engel, Paulo M. Dealing with non-stationary environments using context detection. In Proceedings of the 23rd international conference on Machine learning, pp. 217–224. ACM, 2006.
    Google ScholarLocate open access versionFindings
  • Foerster, Jakob, Assael, Yannis M, de Freitas, Nando, and Whiteson, Shimon. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.
    Google ScholarLocate open access versionFindings
  • Hausknecht, Matthew and Stone, Peter. Deep recurrent qlearning for partially observable mdps. arXiv preprint arXiv:1507.06527, 2015.
    Findings
  • Hausknecht, Matthew, Mupparaju, Prannoy, Subramanian, Sandeep, Kalyanakrishnan, S, and Stone, P. Half field offense: an environment for multiagent learning and ad hoc teamwork. In AAMAS Adaptive Learning Agents (ALA) Workshop, 2016.
    Google ScholarLocate open access versionFindings
  • He, He, Boyd-Graber, Jordan, Kwok, Kevin, and Daume III, Hal. Opponent modeling in deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1804–1813, 2016.
    Google ScholarLocate open access versionFindings
  • Hochreiter, Sepp and Schmidhuber, Jurgen. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Jorge, Emilio, Kageback, Mikael, and Gustavsson, Emil. Learning to play guess who? and inventing a grounded language as a consequence. arXiv preprint arXiv:1611.03218, 2016.
    Findings
  • Kok, Jelle R and Vlassis, Nikos. Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research, 7(Sep):1789–1828, 2006.
    Google ScholarLocate open access versionFindings
  • Kuyer, Lior, Whiteson, Shimon, Bakker, Bram, and Vlassis, Nikos. Multiagent reinforcement learning for urban traffic control using coordination graphs. In ECML 2008: Proceedings of the Nineteenth European Conference on Machine Learning, pp. 656–671, September 2008.
    Google ScholarLocate open access versionFindings
  • Lauer, Martin and Riedmiller, Martin. An algorithm for distributed reinforcement learning in cooperative multiagent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer, 2000.
    Google ScholarLocate open access versionFindings
  • Leibo, Joel Z, Zambaldi, Vinicius, Lanctot, Marc, Marecki, Janusz, and Graepel, Thore. Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037, 2017.
    Findings
  • Makar, Rajbala, Mahadevan, Sridhar, and Ghavamzadeh, Mohammad. Hierarchical multi-agent reinforcement learning. In Proceedings of the fifth international conference on Autonomous agents, pp. 246–253. ACM, 2001.
    Google ScholarLocate open access versionFindings
  • Mataric, Maja J. Using communication to reduce locality in distributed multiagent learning. Journal of experimental & theoretical artificial intelligence, 10(3):357–369, 1998.
    Google ScholarLocate open access versionFindings
  • Matignon, Laetitia, Laurent, Guillaume J, and Le Fort-Piat, Nadine. Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. The Knowledge Engineering Review, 27(01): 1–31, 2012.
    Google ScholarLocate open access versionFindings
  • Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529– 533, 2015.
    Google ScholarLocate open access versionFindings
  • Robert, CP and Casella, G. Monte carlo statistical methods springer. New York, 2004.
    Google ScholarFindings
  • Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Silver, David. Prioritized experience replay. CoRR, abs/1511.05952, 2015.
    Findings
  • Shoham, Y. and Leyton-Brown, K. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, New York, 2009.
    Google ScholarFindings
  • Sukhbaatar, Sainbayar, Fergus, Rob, et al. Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems, pp. 2244–2252, 2016.
    Google ScholarLocate open access versionFindings
  • Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
    Google ScholarFindings
  • Wang, Ziyu, Bapst, Victor, Heess, Nicolas, Mnih, Volodymyr, Munos, Remi, Kavukcuoglu, Koray, and de Freitas, Nando. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
    Findings
  • Watkins, Christopher John Cornish Hellaby. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
    Google ScholarFindings
  • Yang, Erfu and Gu, Dongbing. Multiagent reinforcement learning for multi-robot systems: A survey. Technical report, tech. rep, 2004.
    Google ScholarFindings
  • Ye, Dayong, Zhang, Minjie, and Yang, Yun. A multiagent framework for packet routing in wireless sensor networks. sensors, 15(5):10026–10047, 2015.
    Google ScholarLocate open access versionFindings
  • Zawadzki, E., Lipson, A., and Leyton-Brown, K. Empirically evaluating multiagent learning algorithms. arXiv preprint 1401.8074, 2014.
    Findings
  • Synnaeve, Gabriel, Nardelli, Nantas, Auvolat, Alex, Chintala, Soumith, Lacroix, Timothee, Lin, Zeming, Richoux, Florian, and Usunier, Nicolas. Torchcraft: a library for machine learning research on real-time strategy games. arXiv preprint arXiv:1611.00625, 2016.
    Findings
  • Tampuu, Ardi, Matiisen, Tambet, Kodelja, Dorian, Kuzovkin, Ilya, Korjus, Kristjan, Aru, Juhan, Aru, Jaan, and Vicente, Raul. Multiagent cooperation and competition with deep reinforcement learning. arXiv preprint arXiv:1511.08779, 2015.
    Findings
  • Tan, Ming. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330– 337, 1993.
    Google ScholarLocate open access versionFindings
  • Tesauro, Gerald. Extending q-learning to general adaptive multi-agent systems. In NIPS, volume 4, 2003.
    Google ScholarLocate open access versionFindings
  • Usunier, Nicolas, Synnaeve, Gabriel, Lin, Zeming, and Chintala, Soumith. Episodic exploration for deep deterministic policies: An application to starcraft micromanagement tasks. arXiv preprint arXiv:1609.02993, 2016.
    Findings
  • Van der Pol, Elise and Oliehoek, Frans A. Coordinated deep reinforcement learners for traffic light control. In NIPS’16 Workshop on Learning, Inference and Control of Multi-Agent Systems, 2016.
    Google ScholarLocate open access versionFindings
0
您的评分 :

暂无评分

标签
评论
avatar
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn