Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning

arXiv: Learning, 2018.

Cited by: 90|Views200
EI
Weibo:
It is clear that influence is essential to achieve any form of learning, attesting to the promise of this idea and highlighting the complexity of learning general deep neural network multi-agent policies

Abstract:

We propose a unified mechanism for achieving coordination and communication in Multi-Agent Reinforcement Learning (MARL), through rewarding agents for having causal influence over other agentsu0027 actions. Causal influence is assessed using counterfactual reasoning. At each timestep, an agent simulates alternate actions that it could hav...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Intrinsic Motivation for Reinforcement Learning (RL) refers to reward functions that allow agents to learn useful behavior across a variety of tasks and environments, sometimes in the absence of environmental reward (Singh et al, 2004).
  • Sequeira et al (2011); Hughes et al (2018); Peysakhovich & Lerer (2018)), these approaches rely on hand-crafted rewards specific to the environment, or allowing agents to view the rewards obtained by other agents.
  • Such assumptions make it impossible to achieve independent training of MARL agents across multiple environments
Highlights
  • Intrinsic Motivation for Reinforcement Learning (RL) refers to reward functions that allow agents to learn useful behavior across a variety of tasks and environments, sometimes in the absence of environmental reward (Singh et al, 2004)
  • While some previous work has investigated intrinsic social motivation for Reinforcement Learning (e.g. Sequeira et al (2011); Hughes et al (2018); Peysakhovich & Lerer (2018)), these approaches rely on hand-crafted rewards specific to the environment, or allowing agents to view the rewards obtained by other agents
  • All three experiments have shown that the proposed intrinsic social influence reward consistently leads to higher collective return
  • Despite variation in the tasks, hyper-parameters, neural network architectures and experimental setups, the learning curves for agents trained with the influence reward are significantly better than the curves of powerful agents such as A3C and their improved baselines
  • It is clear that influence is essential to achieve any form of learning, attesting to the promise of this idea and highlighting the complexity of learning general deep neural network multi-agent policies
  • Experiment I showed that the influence reward can lead to the emergence of communication protocols
Results
  • An agent’s influence is only greater than its mean influence in less than 10% of timesteps.
Conclusion
  • Conclusions and Future

    Work

    All three experiments have shown that the proposed intrinsic social influence reward consistently leads to higher collective return.
  • Despite variation in the tasks, hyper-parameters, neural network architectures and experimental setups, the learning curves for agents trained with the influence reward are significantly better than the curves of powerful agents such as A3C and their improved baselines.
  • Experiment the author showed that the influence reward can lead to the emergence of communication protocols.
  • Experiment III showed that influence can be computed by augmenting agents with an internal model of other agents.
  • The authors were able to surpass state-of-the-art performance on the SSDs studied here, despite the fact that previous work relied on agents’ ability to view other agents’ rewards
Summary
  • Introduction:

    Intrinsic Motivation for Reinforcement Learning (RL) refers to reward functions that allow agents to learn useful behavior across a variety of tasks and environments, sometimes in the absence of environmental reward (Singh et al, 2004).
  • Sequeira et al (2011); Hughes et al (2018); Peysakhovich & Lerer (2018)), these approaches rely on hand-crafted rewards specific to the environment, or allowing agents to view the rewards obtained by other agents.
  • Such assumptions make it impossible to achieve independent training of MARL agents across multiple environments
  • Results:

    An agent’s influence is only greater than its mean influence in less than 10% of timesteps.
  • Conclusion:

    Conclusions and Future

    Work

    All three experiments have shown that the proposed intrinsic social influence reward consistently leads to higher collective return.
  • Despite variation in the tasks, hyper-parameters, neural network architectures and experimental setups, the learning curves for agents trained with the influence reward are significantly better than the curves of powerful agents such as A3C and their improved baselines.
  • Experiment the author showed that the influence reward can lead to the emergence of communication protocols.
  • Experiment III showed that influence can be computed by augmenting agents with an internal model of other agents.
  • The authors were able to surpass state-of-the-art performance on the SSDs studied here, despite the fact that previous work relied on agents’ ability to view other agents’ rewards
Related work
  • Several attempts have been made to develop intrinsic social rewards.2 Sequeira et al (2011) developed hand-crafted rewards for a foraging environment, in which agents were punished for eating more than their fair share of food. Another approach gave agents an emotional intrinsic reward based on their perception of their neighbours’ cooperativeness in a networked version of the iterated prisoner’s dilemma, but is limited to scenarios in which it is possible to directly classify each action as cooperative or non-cooperative (Yu et al, 2013). This is untenable in complex settings with long-term strategies, such as the SSDs under investigation here.

    Some approaches allow agents to view each others’ rewards in order to optimize for collective reward. Peysakhovich & Lerer (2018) show that if even a single agent is trained to optimize for others’ rewards, it can significantly help the group. Hughes et al (2018) introduced an inequity aversion motivation, which penalized agents if their rewards differed too much from those of the group. Liu et al (2014) train agents to learn their own optimal reward function in a cooperative, multi-agent setting with known group reward. However, the assumption that agents can view and optimize for each others’ rewards may be unrealistic. Thus, recent work explores training agents that learn when to cooperate based solely on their own past rewards (Peysakhovich & Lerer, 2017).
Reference
  • Barton, S. L., Waytowich, N. R., Zaroukian, E., and Asher, D. E. Measuring collaborative emergent behavior in multi-agent reinforcement learning. arXiv preprint arXiv:1807.08663, 2018.
    Findings
  • Bogin, B., Geva, M., and Berant, J. Emergence of communication in an interactive world with consistent speakers. arXiv preprint arXiv:1809.00549, 2018.
    Findings
  • Cao, K., Lazaridou, A., Lanctot, M., Leibo, J. Z., Tuyls, K., and Clark, S. Emergent communication through negotiation. arXiv preprint arXiv:1804.03980, 2018.
    Findings
  • Choi, E., Lazaridou, A., and de Freitas, N. Compositional obverter communication learning from raw visual input. arXiv preprint arXiv:1804.02341, 2018.
    Findings
  • Crandall, J. W., Oudah, M., Chenlinangjia, T., IshowoOloko, F., Abdallah, S., Bonnefon, J., Cebrian, M., Shariff, A., Goodrich, M. A., and Rahwan, I. Cooperating with machines. CoRR, abs/1703.06207, 2017. URL http://arxiv.org/abs/1703.06207.
    Findings
  • Crawford, V. P. and Sobel, J. Strategic information transmission. Econometrica: Journal of the Econometric Society, pp. 1431–1451, 1982.
    Google ScholarLocate open access versionFindings
  • Devlin, S., Yliniemi, L., Kudenko, D., and Tumer, K. Potential-based difference rewards for multiagent reinforcement learning. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pp. 165–172. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
    Google ScholarLocate open access versionFindings
  • Ferguson, H. J., Scheepers, C., and Sanford, A. J. Expectations in counterfactual and theory of mind reasoning. Language and Cognitive Processes, 25(3):297–346, 2010. doi: 10.1080/01690960903041174. URL https://doi.org/10.1080/01690960903041174.
    Locate open access versionFindings
  • ment learning. In Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.
    Google ScholarLocate open access versionFindings
  • Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926, 2017.
    Findings
  • Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. Learning with opponentlearning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 122–130. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Forestier, S. and Oudeyer, P.-Y. A unified model of speech and tool use early development. In 39th Annual Conference of the Cognitive Science Society (CogSci 2017), 2017.
    Google ScholarLocate open access versionFindings
  • Gers, F. A., Schmidhuber, J., and Cummins, F. Learning to forget: Continual prediction with lstm. 1999.
    Google ScholarFindings
  • Guckelsberger, C., Salge, C., and Togelius, J. New and surprising ways to be mean. adversarial npcs with coupled empowerment minimisation. arXiv preprint arXiv:1806.01387, 2018.
    Findings
  • Harari, Y. N. Sapiens: A brief history of humankind. Random House, 2014.
    Google ScholarFindings
  • Henrich, J. The Secret of Our Success: How culture is driving human evolution, domesticating our species, and making us smart. Princeton University Press, Princeton, NJ, 2015. URL http://press.princeton.edu/titles/10543.html.
    Findings
  • Herrmann, E., Call, J., Hernandez-Lloreda, M. V., Hare, B., and Tomasello, M. Humans have evolved specialized skills of social cognition: The cultural intelligence hypothesis. Science, 317(5843):1360–1366, 2007. ISSN 0036-8075. doi: 10.1126/science. 1146282. URL http://science.sciencemag.org/content/317/5843/1360.
    Locate open access versionFindings
  • Hughes, E., Leibo, J. Z., Phillips, M. G., Tuyls, K., DuenezGuzman, E. A., Castaneda, A. G., Dunning, I., Zhu, T., McKee, K. R., Koster, R., et al. Inequity aversion improves cooperation in intertemporal social dilemmas. In Advances in neural information processing systems (NIPS), Montreal, Canada, 2018.
    Google ScholarLocate open access versionFindings
  • Klyubin, A. S., Polani, D., and Nehaniv, C. L. Empowerment: A universal agent-centric measure of control. In Evolutionary Computation, 2005. The 2005 IEEE Congress on, volume 1, pp. 128–135. IEEE, 2005.
    Google ScholarLocate open access versionFindings
  • University Press Princeton, 2017. ISBN 9781400884872 140088487.
    Google ScholarFindings
  • Lazaridou, A., Hermann, K. M., Tuyls, K., and Clark, S. Emergence of linguistic communication from referential games with symbolic and pixel input. arXiv preprint arXiv:1804.03984, 2018.
    Findings
  • Leibo, J. Z., Zambaldi, V., Lanctot, M., Marecki, J., and Graepel, T. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 464– 473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Liu, B., Singh, S., Lewis, R. L., and Qin, S. Optimal rewards for cooperative agents. IEEE Transactions on Autonomous Mental Development, 6(4):286–297, 2014.
    Google ScholarLocate open access versionFindings
  • Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., and Mordatch, I. Multi-agent actor-critic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390, 2017.
    Google ScholarLocate open access versionFindings
  • Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Mohamed, S. and Rezende, D. J. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 2125–2133, 2015.
    Google ScholarLocate open access versionFindings
  • Oudeyer, P.-Y. and Kaplan, F. Discovering communication. Connection Science, 18(2):189–206, 2006.
    Google ScholarLocate open access versionFindings
  • Oudeyer, P.-Y. and Smith, L. B. How evolution may work through curiosity-driven developmental process. Topics in Cognitive Science, 8(2):492–502, 2016.
    Google ScholarLocate open access versionFindings
  • Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
    Google ScholarLocate open access versionFindings
  • Pearl, J. Structural counterfactuals: A brief introduction. Cognitive science, 37(6):977–985, 2013.
    Google ScholarLocate open access versionFindings
  • Pearl, J., Glymour, M., and Jewell, N. P. Causal inference in statistics: a primer. John Wiley & Sons, 2016.
    Google ScholarFindings
  • Peysakhovich, A. and Lerer, A. Consequentialist conditional cooperation in social dilemmas with imperfect information. arXiv preprint arXiv:1710.06975, 2017.
    Findings
  • Peysakhovich, A. and Lerer, A. Prosocial learning agents solve generalized stag hunts better than selfish ones. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2043–2044. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., Eslami, S., and Botvinick, M. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.
    Findings
  • Schmidhuber, J. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
    Google ScholarLocate open access versionFindings
  • Sequeira, P., Melo, F. S., Prada, R., and Paiva, A. Emerging social awareness: Exploring intrinsic motivation in multiagent learning. In Development and Learning (ICDL), 2011 IEEE International Conference on, volume 2, pp. 1–6. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • Singh, S. P., Barto, A. G., and Chentanez, N. Intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems 17 [Neural Information Processing Systems, NIPS 2004, December 13-18, 2004, Vancouver, British Columbia, Canada], pp. 1281–1288, 2004.
    Google ScholarLocate open access versionFindings
  • Stavropoulos, K. K. and Carver, L. J. Research review: social motivation and oxytocin in autism–implications for joint attention development and intervention. Journal of Child Psychology and Psychiatry, 54(6):603–618, 2013.
    Google ScholarLocate open access versionFindings
  • Strouse, D., Kleiman-Weiner, M., Tenenbaum, J., Botvinick, M., and Schwab, D. Learning to share and hide intentions using information regularization. arXiv preprint arXiv:1808.02093, 2018.
    Findings
  • Tajima, S., Yanagawa, T., Fujii, N., and Toyoizumi, T. Untangling brain-wide dynamics in consciousness by cross-embedding. PLoS computational biology, 11(11): e1004537, 2015.
    Google ScholarLocate open access versionFindings
  • Tomasello, M. Why we cooperate. MIT press, 2009.
    Google ScholarFindings
  • van Schaik, C. P. and Burkart, J. M. Social learning and evolution: the cultural intelligence hypothesis. Philosophical Transactions of the Royal Society B: Biological Sciences, 366(1567):1008–1016, 2011.
    Google ScholarLocate open access versionFindings
  • von Frisch, K. The dance language and orientation of bees. 5, 06 1969.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments