## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments.

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), (2017): 6379-6390

EI

关键词

摘要

We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then pre...更多

代码：

数据：

简介

- Reinforcement learning (RL) has recently been applied to solve challenging problems, from game playing [24, 29] to robotics [18].
- Multi-robot control [21], the discovery of communication and language [31, 8, 25], multiplayer games [28], and the analysis of social dilemmas [17] all operate in a multi-agent domain
- Related problems, such as variants of hierarchical reinforcement learning [6] can be seen as a multi-agent system, with multiple levels of hierarchy being equivalent to multiple agents.

重点内容

- Reinforcement learning (RL) has recently been applied to solve challenging problems, from game playing [24, 29] to robotics [18]
- 4.1 Multi-Agent Actor Critic We have argued in the previous section that naïve execution π ... π policy gradient methods perform poorly in simple multi-agent settings, and this is supported in our exo a ... o a periments in Section 5
- As shown in Section 5, we find the centralized critic with deterministic policies works very well in practice, and refer to it as multi-agent deep deterministic policy gradient (MADDPG)
- In the covert communication environment, we found that Bob trained with both multi-agent deep deterministic policy gradient and Deep deterministic policy gradient out-performs Eve in terms of reconstructing Alice’s message
- We have proposed a multi-agent policy gradient algorithm where agents learn a centralized critic based on the observations and actions of all agents
- One downside to our approach is that the input space of Q grows linearly with the number of agents N

方法

**Methods training**

4.1 Multi-Agent Actor Critic The authors have argued in the previous section that naïve execution π ... π

N policy gradient methods perform poorly in simple multi-agent settings, and this is supported in the exo a ... o a periments in Section 5.- 4.1 Multi-Agent Actor Critic The authors have argued in the previous section that naïve execution π ...
- Π. N policy gradient methods perform poorly in simple multi-agent settings, and this is supported in the exo a ...
- QN local information at execution time, (2) the authors do not assume a differentiable Figure 1: aOgevnte1rview of the maguelnttiN-agent decenl model of the environment dynamics, unlike in [25], tralized actor, centralized critic approach.
- The authors propose a simple extension of actor-critic policy gradient methods where the critic is augmented with extra information about the policies of other agents

结果

- In all of the experiments, the authors use the Adam optimizer with a learning rate of 0.01 and τ = 0.01 for updating the target networks.
- The size of the replay buffer is 106 and the authors update the network parameters after every 100 samples added to the replay buffer.
- The authors use a batch size of 1024 episodes before making an update, except for TRPO where the authors found a batch size of 50 lead to better performance.
- Agent π MADDPG DDPG DQN Actor-Critic TRPO REINFORCE
- The authors train with 10 random seeds for environments with stark success/ fail conditions and 3 random seeds for the other environments.

结论

**Conclusions and Future**

Work

The authors have proposed a multi-agent policy gradient algorithm where agents learn a centralized critic based on the observations and actions of all agents.- One downside to the approach is that the input space of Q grows linearly with the number of agents N
- This could be remedied in practice by, for example, having a modular Q function that only considers agents in a certain neighborhood of a given agent.
- The authors leave this investigation to future work

- Table1: Percentage of episodes where the agent reached the target landmark and average distance from the target in the cooperative communication environment, after 25000 episodes. Note that the percentage of targets reached is different than the policy learning success rate in Figure 6, which indicates the percentage of runs in which the correct policy was learned (consistently reaching the target landmark). Even when the correct behavior is learned, agents occasionally hover slightly outside the target landmark on some episodes, and conversely agents who learn to go to the middle of the landmarks occasionally stumble upon the correct landmark
- Table2: Average # of collisions per episode and average agent distance from a landmark in the cooperative navigation task, using 2-layer 128 unit MLP policies
- Table3: Average number of prey touches by predator per episode on two predator-prey environments with N = L = 3, one where the prey (adversaries) are slightly (30%) faster (PP1), and one where they are significantly (100%) faster (PP2). All policies in this experiment are 2-layer 128 unit MLPs
- Table4: Results on the physical deception task, with N = 2 and 4 cooperative agents/landmarks. Success (succ %) for agents (AG) and adversaries (ADV) is if they are within a small distance from the target landmark
- Table5: Agent (Bob) and adversary (Eve) success rate (succ %, i.e. correctly reconstructing the speaker’s message) in the covert communication environment. The input message is drawn from a set of two 4-dimensional one-hot vectors
- Table6: Evaluations of the adversary agent w./w.o. policy ensembles over 1000 trials on different scenarios including (a) keep-away (KA) with N = M = 1, (b) physical deception (PD) with N = 2 and (c) predator-prey (PP) with N = 4 and L = 1. S. denotes agents with a single policy. E. denotes agents with policy ensembles

相关工作

- The simplest approach to learning in multi-agent settings is to use independently learning agents. This was attempted with Q-learning in [36], but does not perform well in practice [23]. As we will show, independently-learning policy gradient methods also perform poorly. One issue is that each agent’s policy changes during training, resulting in a non-stationary environment and preventing the naïve application of experience replay. Previous work has attempted to address this by inputting other agent’s policy parameters to the Q function [37], explicitly adding the iteration index to the replay buffer, or using importance sampling [9]. Deep Q-learning approaches have previously been investigated in [35] to train competing Pong agents.

基金

- Ryan Lowe is supported in part by a Vanier CGS Scholarship and the Samsung Advanced Institute of Technology

引用论文

- DeepMind AI reduces google data centre cooling bill by 40. https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/. Accessed:2017-05-19.
- M. Abadi and D. G. Andersen. Learning to protect communications with adversarial neural cryptography. arXiv preprint arXiv:1610.06918, 2016.
- C. Boutilier. Learning conventions in multiagent stochastic domains using likelihood estimates. In Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence, pages 106–114. Morgan Kaufmann Publishers Inc., 1996.
- L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems Man and Cybernetics Part C Applications and Reviews, 38(2):156, 2008.
- G. Chalkiadakis and C. Boutilier. Coordination in multiagent reinforcement learning: a bayesian approach. In Proceedings of the second international joint conference on Autonomous agents and multiagent systems, pages 709–716. ACM, 2003.
- P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–271. Morgan Kaufmann Publishers, 1993.
- J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926, 2017.
- J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with deep multi-agent reinforcement learning. CoRR, abs/1605.06676, 2016.
- J. N. Foerster, N. Nardelli, G. Farquhar, P. H. S. Torr, P. Kohli, and S. Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. CoRR, abs/1702.08887, 2017.
- M. C. Frank and N. D. Goodman. Predicting pragmatic reasoning in language games. Science, 336(6084):998–998, 2012.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- J. K. Gupta, M. Egorov, and M. Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. 2017.
- J. Hu and M. P. Wellman. Online learning about other agents in a dynamic multiagent system. In Proceedings of the Second International Conference on Autonomous Agents, AGENTS ’98, pages 239–246, New York, NY, USA, 1998. ACM.
- E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- M. Lauer and M. Riedmiller. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning, pages 535–542. Morgan Kaufmann, 2000.
- A. Lazaridou, A. Peysakhovich, and M. Baroni. Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182, 2016.
- J. Z. Leibo, V. F. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel. Multi-agent reinforcement learning in sequential social dilemmas. CoRR, abs/1702.03037, 2017.
- S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the eleventh international conference on machine learning, volume 157, pages 157–163, 1994.
- L. Matignon, L. Jeanpierre, A.-I. Mouaddib, et al. Coordinated multi-robot exploration under communication constraints using decentralized markov decision processes. In AAAI, 2012.
- L. Matignon, G. J. Laurent, and N. Le Fort-Piat. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In Intelligent Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on, pages 64–69. IEEE, 2007.
- L. Matignon, G. J. Laurent, and N. Le Fort-Piat. Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. The Knowledge Engineering Review, 27(01):1–31, 2012.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908, 2017.
- S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. CoRR, abs/1703.06182, 2017.
- L. Panait and S. Luke. Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 11(3):387–434, Nov. 2005.
- P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang. Multiagent bidirectionallycoordinated nets for learning to play starcraft combat games. CoRR, abs/1703.10069, 2017.
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 – 489, 2016.
- D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, pages 387–395, 2014.
- S. Sukhbaatar, R. Fergus, et al. Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems, pages 2244–2252, 2016.
- S. Sukhbaatar, I. Kostrikov, A. Szlam, and R. Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017.
- R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
- A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.
- M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330–337, 1993.
- G. Tesauro. Extending q-learning to general adaptive multi-agent systems. In Advances in neural information processing systems, pages 871–878, 2004.
- P. S. Thomas and A. G. Barto. Conjugate markov decision processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 137–144, 2011.
- R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn