Learning to Teach in Cooperative Multiagent Reinforcement Learning

national conference on artificial intelligence, 2019.

Cited by: 26|Bibtex|Views187
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
In contrast to existing works, this paper presents the first general framework and algorithm for intelligent agents to learn to teach in a multiagent environment

Abstract:

Collective human knowledge has clearly benefited from the fact that innovations by individuals are taught to others through communication. Similar to human social groups, agents in distributed learning systems would likely benefit from communication to share knowledge and teach skills. The problem of teaching to improve agent learning has...More

Code:

Data:

0
Introduction
  • Innovations by individuals are taught to others in the population through communication channels (Rogers 2010), which improves final performance, and the effectiveness of the entire learning process.
  • Following advising, agents should ideally have learned effective task-level policies that no longer rely on teammate advice at every timestep.
  • By learning to appropriately transform local knowledge into action advice, teachers can affect students’ experiences and their resulting task-level policy updates.
Highlights
  • In social settings, innovations by individuals are taught to others in the population through communication channels (Rogers 2010), which improves final performance, and the effectiveness of the entire learning process
  • Similar to human social groups, these learning agents would likely benefit from communication to share knowledge and teach skills, thereby improving the effectiveness of systemwide learning
  • Our work targets cooperative Multiagent Reinforcement Learning (MARL), where agents execute actions that jointly affect the environment, receive feedback via local observations and a shared reward. This setting is formalized as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), defined as I, S, A, T, R, Ω, O, γ (Oliehoek and Amato 2016); I is the set of n agents, S is the state space, A = ×iAi is the joint action space, and Ω = ×iΩi is the joint observation space
  • Hit =, agent i executes actions dictated by its policy ai = πi(hit)
  • The advising-level learning nature of our problem makes these domains challenging, despite their visual simplicity; their complexity is comparable to domains tested in recent MARL works that learn over multiagent learning processes (Foerster et al 2018), which consider two agent repeated/gridworld games
  • We use the VEG advising-level reward in the final version of our Learning to Coordinate and Teach Reinforcement (LeCTR) algorithm, but show all advising reward results for completeness. We report both final task-level performance after teaching, V, and area under the tasklevel learning curve (AUC) as a measure of rate of learning; higher values are better for both
Results
  • Action advising makes few assumptions, in that learners need only use task-level algorithms Li, Lj that support off-policy exploration, and that they receive advisinglevel observations summarizing teammates’ learning progress.
  • Agents in the problem must learn to advise despite the nonstationarities due to changing task-level policies, which are a function of algorithms Li, Lj and policy parameterizations θi, θj .
  • Phase II: Train student/teacher policies, using task-level learning rate as reward
  • The objective is to learn advising policies that augment agents’ task-level algorithms Li, Lj to accelerate solving of PTask.
  • LeCTR uses distinct advising-level observations for student and teacher policies.
  • Student policy πSi for agent i decides when to request advice using advising-level observation oiS = oi, Qi(oi; hi) , where oi and Qi(oi; hi) are the agent’s task-level observation and action-value vectors, respectively.
  • LeCTR, advising policies are trained to maximize advisinglevel rewards that should, ideally, reflect the objective of accelerating task-level learning.
  • Each reward corresponds to a different measure of task-level learning after the student executes an action advice and uses it to update its task-level policy.
  • To induce π to learn to teach both agents i and j, the authors use a centralized action-value function (i.e., ‘critic’) with advising-level reward r = rTi + rTj .
  • The advising-level learning nature of the problem makes these domains challenging, despite their visual simplicity; their complexity is comparable to domains tested in recent MARL works that learn over multiagent learning processes (Foerster et al 2018), which consider two agent repeated/gridworld games.
  • The left plot shows task-level return received due to both local policy actions and advised actions, which increases as teachers learn.
Conclusion
  • In the Repeated game, LeCTR attains best performance in terms of final value and rate of learning (AUC), as agents learn to advise each other to take the appropriate high-value opposite actions.
  • Fig. 4b shows improvement of LeCTR’s advising policies due to training, measured by the number of task-level episodes needed to converge to the max value reached, V ∗.
  • LeCTR, uses agents’ task-level learning progress as advising policy feedback, training advisors that improve the rate of learning without harming final performance.
Summary
  • Innovations by individuals are taught to others in the population through communication channels (Rogers 2010), which improves final performance, and the effectiveness of the entire learning process.
  • Following advising, agents should ideally have learned effective task-level policies that no longer rely on teammate advice at every timestep.
  • By learning to appropriately transform local knowledge into action advice, teachers can affect students’ experiences and their resulting task-level policy updates.
  • Action advising makes few assumptions, in that learners need only use task-level algorithms Li, Lj that support off-policy exploration, and that they receive advisinglevel observations summarizing teammates’ learning progress.
  • Agents in the problem must learn to advise despite the nonstationarities due to changing task-level policies, which are a function of algorithms Li, Lj and policy parameterizations θi, θj .
  • Phase II: Train student/teacher policies, using task-level learning rate as reward
  • The objective is to learn advising policies that augment agents’ task-level algorithms Li, Lj to accelerate solving of PTask.
  • LeCTR uses distinct advising-level observations for student and teacher policies.
  • Student policy πSi for agent i decides when to request advice using advising-level observation oiS = oi, Qi(oi; hi) , where oi and Qi(oi; hi) are the agent’s task-level observation and action-value vectors, respectively.
  • LeCTR, advising policies are trained to maximize advisinglevel rewards that should, ideally, reflect the objective of accelerating task-level learning.
  • Each reward corresponds to a different measure of task-level learning after the student executes an action advice and uses it to update its task-level policy.
  • To induce π to learn to teach both agents i and j, the authors use a centralized action-value function (i.e., ‘critic’) with advising-level reward r = rTi + rTj .
  • The advising-level learning nature of the problem makes these domains challenging, despite their visual simplicity; their complexity is comparable to domains tested in recent MARL works that learn over multiagent learning processes (Foerster et al 2018), which consider two agent repeated/gridworld games.
  • The left plot shows task-level return received due to both local policy actions and advised actions, which increases as teachers learn.
  • In the Repeated game, LeCTR attains best performance in terms of final value and rate of learning (AUC), as agents learn to advise each other to take the appropriate high-value opposite actions.
  • Fig. 4b shows improvement of LeCTR’s advising policies due to training, measured by the number of task-level episodes needed to converge to the max value reached, V ∗.
  • LeCTR, uses agents’ task-level learning progress as advising policy feedback, training advisors that improve the rate of learning without harming final performance.
Tables
  • Table1: Summary of rewards used to train advising policies. Rewards shown are for the case where agent i is student and agent j teacher (i.e., flip the indices for the reverse case). Each reward corresponds to a different measure of task-level learning after the student executes an action advice and uses it to update its task-level policy. Refer to the supplementary material for more details
  • Table2: Vand Area under the Curve (AUC) for teaching algorithms. Best results in bold (computed via a t-test with p < 0.05). Independent Q-learning correspond to the no-teaching case. †Final version LeCTR uses the VEG advising-level reward
Download tables as Excel
Related work
  • Effective diffusion of knowledge has been studied in many fields, including inverse reinforcement learning (Ng and Russell 2000), apprenticeship learning (Abbeel and Ng 2004), and learning from demonstration (Argall et al 2009), wherein students discern and emulate key demonstrated behaviors. Works on curriculum learning (Bengio et al 2009) are also related, particularly automated curriculum learning (Graves et al 2017). Though Graves et al focus on single student supervised/unsupervised learning, they highlight interesting measures of learning progress also used here. Several works meta-learn active learning policies for supervised learning (Bachman, Sordoni, and Trischler 2017; Fang, Li, and Cohn 2017; Pang, Dong, and Hospedales 2018; Fan et al 2018). Our work also uses advising-level metalearning, but in the regime of MARL, where agents must learn to advise teammates without destabilizing coordination.

    In action advising, a student executes actions suggested by a teacher, who is typically an expert always advising the optimal action (Torrey and Taylor 2013). These works typically use state importance value I(s, a) = maxa Q(s, a) − Q(s, a) to decide when to advise, estimating the performance difference between the student’s best action versus intended/worstcase action a. In student-initiated approaches such as Ask Uncertain (Clouse 1996) and Ask Important (Amir et al 2016), the student decides when to request advice using heuristics based on I(s, a). In teacher-initiated approaches such as Importance Advising (Torrey and Taylor 2013), Early Correcting (Amir et al 2016), and Correct Important (Torrey and Taylor 2013), the teacher decides when to advise by comparing student policy πS to expert policy πT . Q-Teaching (Fachantidis, Taylor, and Vlahavas 2017) learns when to advise by rewarding the teacher I(s, a) when it advises. See the supplementary material for details of these approaches.
Funding
  • Research funded by IBM (as part of the MIT-IBM Watson AI Lab initiative) and computational support through Amazon Web Services
  • Dong-Ki Kim was also supported by a Kwanjeong Educational Foundation Fellowship
Reference
  • Abbeel, P., and Ng, A. Y. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, ACM.
    Google ScholarLocate open access versionFindings
  • Amir, O.; Kamar, E.; Kolobov, A.; and Grosz, B. J. 2016. Interactive teaching strategies for agent training. International Joint Conferences on Artificial Intelligence.
    Google ScholarFindings
  • Argall, B. D.; Chernova, S.; Veloso, M.; and Browning, B. 2009. A survey of robot learning from demonstration. Robotics and autonomous systems 57(5):469–483.
    Google ScholarLocate open access versionFindings
  • Bachman, P.; Sordoni, A.; and Trischler, A. 2017. Learning algorithms for active learning. In International Conference on Machine Learning, 301–310.
    Google ScholarLocate open access versionFindings
  • Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, 41–48. ACM.
    Google ScholarLocate open access versionFindings
  • Clouse, J. A. 199On integrating apprentice learning and reinforcement learning.
    Google ScholarFindings
  • da Silva, F. L.; Glatt, R.; and Costa, A. H. R. 201Simultaneously learning and advising in multiagent reinforcement learning. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, 1100–1108. International Foundation for Autonomous Agents and Multiagent Systems.
    Google ScholarLocate open access versionFindings
  • Fachantidis, A.; Taylor, M. E.; and Vlahavas, I. 2017. Learning to teach reinforcement learning agents. Machine Learning and Knowledge Extraction 1(1):2.
    Google ScholarLocate open access versionFindings
  • Fan, Y.; Tian, F.; Qin, T.; Li, X.-Y.; and Liu, T.-Y. 2018. Learning to teach. In International Conference on Learning Representations.
    Google ScholarFindings
  • Fang, M.; Li, Y.; and Cohn, T. 2017. Learning how to active learn: A deep reinforcement learning approach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 595–605.
    Google ScholarLocate open access versionFindings
  • Foerster, J.; Assael, I. A.; de Freitas, N.; and Whiteson, S. 2016. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, 2137–2145.
    Google ScholarLocate open access versionFindings
  • Foerster, J. N.; Chen, R. Y.; Al-Shedivat, M.; Whiteson, S.; Abbeel, P.; and Mordatch, I. 2018. Learning with opponentlearning awareness. In Proceedings of the 17th Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems.
    Google ScholarLocate open access versionFindings
  • Graves, A.; Bellemare, M. G.; Menick, J.; Munos, R.; and Kavukcuoglu, K. 2017. Automated curriculum learning for neural networks. In International Conference on Machine Learning, 1311–1320.
    Google ScholarLocate open access versionFindings
  • Hadfield-Menell, D.; Russell, S. J.; Abbeel, P.; and Dragan, A. 2016. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, 3909– 3917.
    Google ScholarLocate open access versionFindings
  • Le, H. M.; Yue, Y.; Carr, P.; and Lucey, P. 2017. Coordinated multi-agent imitation learning. In International Conference on Machine Learning, 1995–2003.
    Google ScholarLocate open access versionFindings
  • Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, O. P.; and Mordatch, I. 2017. Multi-agent actor-critic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, 6382–6393.
    Google ScholarLocate open access versionFindings
  • Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529.
    Google ScholarLocate open access versionFindings
  • Ng, A. Y., and Russell, S. J. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, 663–670. Morgan Kaufmann Publishers Inc.
    Google ScholarLocate open access versionFindings
  • Oliehoek, F. A., and Amato, C. 2016. A concise introduction to decentralized POMDPs, volume 1. Springer.
    Google ScholarFindings
  • Pang, K.; Dong, M.; and Hospedales, T. 2018. Meta-learning transferable active learning policies by deep reinforcement learning.
    Google ScholarFindings
  • Rogers, E. M. 2010. Diffusion of innovations. Simon and Schuster.
    Google ScholarFindings
  • Sukhbaatar, S.; Fergus, R.; et al. 2016. Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems, 2244–2252.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S., and Barto, A. G. 1998. Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
    Google ScholarFindings
  • Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057–1063.
    Google ScholarLocate open access versionFindings
  • Taylor, A.; Dusparic, I.; Galvan-Lopez, E.; Clarke, S.; and Cahill, V. 2013. Transfer learning in multi-agent systems through parallel transfer. In Workshop on Theoretically Grounded Transfer Learning at the 30th International Conf. on Machine Learning (Poster), volume 28, 28. Omnipress.
    Google ScholarLocate open access versionFindings
  • Taylor, M. E.; Carboni, N.; Fachantidis, A.; Vlahavas, I.; and Torrey, L. 2014. Reinforcement learning agents providing advice in complex video games. Connection Science 26(1):45–63.
    Google ScholarLocate open access versionFindings
  • Torrey, L., and Taylor, M. 2013. Teaching on a budget: Agents advising agents in reinforcement learning. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, 1053–1060. International Foundation for Autonomous Agents and Multiagent Systems.
    Google ScholarLocate open access versionFindings
  • Wang, Y.; Lu, W.; Hao, J.; Wei, J.; and Leung, H.-F. 2018. Efficient convention emergence through decoupled reinforcement social learning with teacher-student mechanism. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 795–803. International Foundation for Autonomous Agents and Multiagent Systems.
    Google ScholarLocate open access versionFindings
  • Weisstein, E. W. 2004. Bonferroni correction.
    Google ScholarFindings
  • Zimmer, M.; Viappiani, P.; and Weng, P. 2014. Teacherstudent framework: a reinforcement learning approach. In AAMAS Workshop Autonomous Robots and Multirobot Systems.
    Google ScholarFindings
Full Text
Your rating :
0

 

Best Paper
Best Paper of AAAI, 2019
Tags
Comments