AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We argue that coordinating the agents' policies can guide their exploration and we investigate techniques to promote such an inductive bias

Promoting Coordination through Policy Regularization in Multi-Agent Deep Reinforcement Learning

NIPS 2020, (2020)

Cited by: 2|Views137
EI
Full Text
Bibtex
Weibo

Abstract

In multi-agent reinforcement learning, discovering successful collective behaviors is challenging as it requires exploring a joint action space that grows exponentially with the number of agents. While the tractability of independent agent-wise exploration is appealing, this approach fails on tasks that require elaborate group strategies....More

Code:

Data:

0
Introduction
  • Multi-Agent Reinforcement Learning (MARL) refers to the task of training an agent to maximize its expected return by interacting with an environment that contains other learning agents.
  • One leverages centralized critics to approximate the value function of the aggregated observations-actions pairs and train actors restricted to the observation of a single agent
  • Such critics, if exposed to coordinated joint actions leading to high returns, can steer the agents’ policies toward these highly rewarding behaviors.
  • A Markov Game of N independent agents is defined by the tuple S, T , P, {Oi, Ai, Ri}Ni=1 where S, T , and P are respectively the set of all possible states, the transition function and the initial state distribution
  • While these are global properties of the environment, Oi, Ai and Ri are individually defined for each agent i.
  • The initial state s0 is sampled from the initial state distribution P : S → [0, 1]
Highlights
  • Multi-Agent Reinforcement Learning (MARL) refers to the task of training an agent to maximize its expected return by interacting with an environment that contains other learning agents
  • A popular framework for MARL is the use of a Centralized Training and a Decentralized Execution (CTDE) procedure (Lowe et al, 2017; Foerster et al, 2018; Iqbal & Sha, 2019; Foerster et al, 2019; Rashid et al, 2018)
  • We evaluate them by extending MADDPG, a state of the art algorithm widely used in the MARL literature
  • We compare against vanilla MADDPG as well as two of its variants in the four cooperative multi-agent tasks described in Section 6
  • We observe that CoachReg significantly improves performance on three environments (SPREAD, BOUNCE and COMPROMISE)
  • After showing how crucial coordination can be for multiagent learning we proposed two policy regularization methods to promote multi-agent coordination within the CTDE framework: TeamReg, which extends inter-agent modelling to bias the policy search towards predictability and CoachReg that enforces collective and synchronized sub-policy selection
Results
  • The proposed methods offer a way to incorporate new inductive biases in CTDE multi-agent policy search algorithms.
  • The first ablation (MADDPG + agent modelling) is similar to TeamReg but with λ2 = 0, which results in only enforcing agent modelling and not encouraging agent predictability.
  • The second ablation (MADDPG + policy mask) uses the same policy architecture as CoachReg, but with λ1,2,3 = 0, which means that agents still predict and apply a mask to their own policy, but synchronicity is not encouraged
Conclusion
  • After showing how crucial coordination can be for multiagent learning the authors proposed two policy regularization methods to promote multi-agent coordination within the CTDE framework: TeamReg, which extends inter-agent modelling to bias the policy search towards predictability and CoachReg that enforces collective and synchronized sub-policy selection.
  • Interesting avenues for future work would be to study the proposed regularizations on other policy search methods as well as to combine both incentives and investigate how the two coordinating objectives interact.
  • A promising direction is to explore model-based planning approaches to promote long-term multi-agent interactions
Summary
  • Introduction:

    Multi-Agent Reinforcement Learning (MARL) refers to the task of training an agent to maximize its expected return by interacting with an environment that contains other learning agents.
  • One leverages centralized critics to approximate the value function of the aggregated observations-actions pairs and train actors restricted to the observation of a single agent
  • Such critics, if exposed to coordinated joint actions leading to high returns, can steer the agents’ policies toward these highly rewarding behaviors.
  • A Markov Game of N independent agents is defined by the tuple S, T , P, {Oi, Ai, Ri}Ni=1 where S, T , and P are respectively the set of all possible states, the transition function and the initial state distribution
  • While these are global properties of the environment, Oi, Ai and Ri are individually defined for each agent i.
  • The initial state s0 is sampled from the initial state distribution P : S → [0, 1]
  • Objectives:

    The authors aim to answer the following question: can coordination help the discovery of effective policies in cooperative tasks? Intuitively, coordination can be defined as an agent’s behavior being informed by the behavior of another agent, i.e. structure in the agents’ interactions.
  • The authors aim to answer the following question: can coordination help the discovery of effective policies in cooperative tasks?
  • Coordination can be defined as an agent’s behavior being informed by the behavior of another agent, i.e. structure in the agents’ interactions.
  • The authors aim to analyze here the effects of λ2 on cooperative tasks and investigate if it does make the agent modelling task more successful
  • Results:

    The proposed methods offer a way to incorporate new inductive biases in CTDE multi-agent policy search algorithms.
  • The first ablation (MADDPG + agent modelling) is similar to TeamReg but with λ2 = 0, which results in only enforcing agent modelling and not encouraging agent predictability.
  • The second ablation (MADDPG + policy mask) uses the same policy architecture as CoachReg, but with λ1,2,3 = 0, which means that agents still predict and apply a mask to their own policy, but synchronicity is not encouraged
  • Conclusion:

    After showing how crucial coordination can be for multiagent learning the authors proposed two policy regularization methods to promote multi-agent coordination within the CTDE framework: TeamReg, which extends inter-agent modelling to bias the policy search towards predictability and CoachReg that enforces collective and synchronized sub-policy selection.
  • Interesting avenues for future work would be to study the proposed regularizations on other policy search methods as well as to combine both incentives and investigate how the two coordinating objectives interact.
  • A promising direction is to explore model-based planning approaches to promote long-term multi-agent interactions
Tables
  • Table1: Ranges for hyper-parameter search, the log base is 10
  • Table2: Best found hyper-parameters for the SPREAD environment
  • Table3: Best found hyper-parameters for the BOUNCE environment
  • Table4: Best found hyper-parameters for the CHASE environment
  • Table5: Best found hyper-parameters for the COMPROMISE environment
Download tables as Excel
Related work
  • Many works in MARL consider explicit communication channels between the agents and distinguish between communicative actions (e.g. broadcasting a given message) and physical actions (e.g. moving in a given direction) (Foerster et al, 2016; Mordatch & Abbeel, 2018; Lazaridou et al, 2016). Consequently, they often focus on the emergence of language, considering tasks where the agents must discover a common communication protocol to succeed. Deriving a successful communication protocol can already be seen as coordination in the communicative action space and can enable, to some extent, successful coordination in the physical action space (Ahilan & Dayan, 2019). Yet, explicit communication is not a necessary condition for coordination as agents can rely on physical communication (Mordatch & Abbeel, 2018; Gupta et al, 2017).

    Approaches to shape RL agents’ behaviors with respect to other agents have also been explored. Strouse et al (2018) use the mutual information between the agent’s policy and a goal-independent policy to shape the agent’s behavior towards hiding or spelling out its current goal. However, this approach is only applicable for tasks with an explicit goal representation and is not specifically intended for coordination. Jaques et al (2019) approximate the direct causal effect between agent’s actions and use it as an intrinsic reward to encourage social empowerment. This approximation relies on each agent learning a model of other agents’ policies to predict its effect on them. In general, this type of behavior prediction can be referred to as agent modelling (or opponent modelling) and has been used in previous work to enrich representations when amongst stationary non-learning teammates (Hernandez-Leal et al, 2019), to stabilise the learning dynamics (He et al, 2016) or to classify the opponent’s play style (Schadd et al, 2007).
Funding
  • We would also like to thank Fonds de Recherche Nature et Technologies (FRQNT), Ubisoft Montreal and Mitacs for providing funding for this work as well as Compute Canada for providing the computing ressources
Reference
  • Ahilan, S. and Dayan, P. Feudal multi-agent hierarchies for cooperative reinforcement learning. arXiv preprint arXiv:1901.08492, 2019.
    Findings
  • Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • Chentanez, N., Barto, A. G., and Singh, S. P. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 1281–1288, 2005.
    Google ScholarLocate open access versionFindings
  • Foerster, J., Assael, I. A., de Freitas, N., and Whiteson, S. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.
    Google ScholarLocate open access versionFindings
  • Foerster, J., Song, F., Hughes, E., Burch, N., Dunning, I., Whiteson, S., Botvinick, M., and Bowling, M. Bayesian action decoder for deep multi-agent reinforcement learning. International Conference on Machine Learning, 2019.
    Google ScholarLocate open access versionFindings
  • Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Gupta, J. K., Egorov, M., and Kochenderfer, M. J. Cooperative multi-agent control using deep reinforcement learning. In AAMAS Workshops, 2017.
    Google ScholarLocate open access versionFindings
  • He, H., Boyd-Graber, J., Kwok, K., and Daume III, H. Opponent modeling in deep reinforcement learning. In International Conference on Machine Learning, pp. 1804– 1813, 2016.
    Google ScholarLocate open access versionFindings
  • Hernandez-Leal, P., Kartal, B., and Taylor, M. E. Is multiagent deep reinforcement learning the answer or the question? a brief survey. arXiv preprint arXiv:1810.05587, 2018.
    Findings
  • Hernandez-Leal, P., Kartal, B., and Taylor, M. E. Agent Modeling as Auxiliary Task for Deep Reinforcement Learning. In AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 2019.
    Google ScholarLocate open access versionFindings
  • Iqbal, S. and Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 2961–2970, 2019.
    Google ScholarLocate open access versionFindings
  • Jang, E., Gu, S., and Poole, B. Categorical reparametrization with gumble-softmax. In International Conference on Learning Representations (ICLR 2017). OpenReview. net, 2017.
    Google ScholarLocate open access versionFindings
  • Jaques, N., Lazaridou, A., Hughes, E., Gulcehre, C., Ortega, P., Strouse, D., Leibo, J. Z., and De Freitas, N. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International Conference on Machine Learning, pp. 3040–3049, 2019.
    Google ScholarLocate open access versionFindings
  • Jiang, J. and Lu, Z. Learning attentional communication for multi-agent cooperation. In Advances in Neural Information Processing Systems, pp. 7254–7264, 2018.
    Google ScholarLocate open access versionFindings
  • Kaelbling, L. P., Littman, M. L., and Moore, A. W. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Lazaridou, A., Peysakhovich, A., and Baroni, M. Multiagent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182, 2016.
    Findings
  • Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Findings
  • Littman, M. L. Markov games as a framework for multiagent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163.
    Google ScholarLocate open access versionFindings
  • Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., and Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390, 2017.
    Google ScholarLocate open access versionFindings
  • Mordatch, I. and Abbeel, P. Emergence of grounded compositional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
    Google ScholarLocate open access versionFindings
  • Rashid, T., Samvelyan, M., Witt, C. S., Farquhar, G., Foerster, J., and Whiteson, S. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4292–4301, 2018.
    Google ScholarLocate open access versionFindings
  • Schadd, F., Bakkes, S., and Spronck, P. Opponent modeling in real-time strategy games. In GAMEON, pp. 61–70, 2007.
    Google ScholarLocate open access versionFindings
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • Strouse, D., Kleiman-Weiner, M., Tenenbaum, J., Botvinick, M., and Schwab, D. J. Learning to share and hide intentions using information regularization. In Advances in Neural Information Processing Systems, pp. 10270– 10281, 2018.
    Google ScholarLocate open access versionFindings
  • Uhlenbeck, G. E. and Ornstein, L. S. On the theory of the brownian motion. Physical review, 36(5):823, 1930.
    Google ScholarLocate open access versionFindings
  • Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279–292, 1992.
    Google ScholarLocate open access versionFindings
  • Experimental details We trained each agent i with online Q-learning (Watkins & Dayan, 1992) on the Qi(ai, s) table using Boltzmann exploration (Kaelbling et al., 1996). The Boltzmann temperature is fixed to 1 and we set the learning rate to 0.05 and the discount factor to 0.99. After each learning episode we evaluate the current greedy policy on 10 episodes and report the mean return. Curves are averaged over 20 seeds and the shaded area represents the standard error.
    Google ScholarLocate open access versionFindings
  • We perform searches over the following hyper-parameters: the learning rate of the actor αθ, the learning rate of the critic ωφ relative to the actor (αφ = ωφ ∗ αθ), the target-network soft-update parameter τ and the initial scale of the exploration noise ηnoise for the Ornstein-Uhlenbeck noise generating process (Uhlenbeck & Ornstein, 1930) as used by Lillicrap et al. (2015). When using TeamReg and CoachReg, we additionally search over the regularization weights λ1, λ2 and λ3. The learning rate of the coach is always equal to the actor’s learning rate (i.e. αθ = αψ), motivated by their similar architectures and learning signals and in order to reduce the search space. Table 1 shows the ranges from which values for the hyper-parameters are drawn uniformly during the searches.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科