Learning with AMIGo: Adversarially Motivated Intrinsic Goals

international conference on learning representations, 2021.

Cited by: 5|Views222
Weibo:
A "constructively adversarial" teacher-student setup can augment on-policy algorithms to better solve difficult exploration tasks in RL.

Abstract:

A key challenge for reinforcement learning (RL) consists of learning in environments with sparse extrinsic rewards. In contrast to current RL methods, humans are able to learn new skills with little or no reward by using various forms of intrinsic motivation. We propose AMIGo, a novel agent incorporating -- as form of meta-learning -- a g...More
0
ZH
Full Text
Bibtex
Weibo
Introduction
  • The success of Deep Reinforcement Learning (RL) on a wide range of tasks, while impressive, has so far been mostly confined to scenarios with reasonably dense rewards (e.g. Mnih et al, 2016; Vinyals et al, 2019), or to those where a perfect model of the environment can be used for search, such as the game of Go and others (e.g. Silver et al, 2016; Duan et al, 2016; Moravcık et al, 2017).
  • Children devote much of their time to play, generating objectives and posing challenges to themselves as a form of intrinsic motivation
  • Solving such self-proposed tasks encourages them to explore, experiment, and invent; sometimes, as in many games and fantasies, without any direct link to reality or to any source of extrinsic reward.
  • This kind of intrinsic motivation might be a crucial feature to enable learning in real-world environments (Schulz, 2012)
Highlights
  • The success of Deep Reinforcement Learning (RL) on a wide range of tasks, while impressive, has so far been mostly confined to scenarios with reasonably dense rewards (e.g. Mnih et al, 2016; Vinyals et al, 2019), or to those where a perfect model of the environment can be used for search, such as the game of Go and others (e.g. Silver et al, 2016; Duan et al, 2016; Moravcık et al, 2017)
  • Solving such self-proposed tasks encourages them to explore, experiment, and invent; sometimes, as in many games and fantasies, without any direct link to reality or to any source of extrinsic reward. This kind of intrinsic motivation might be a crucial feature to enable learning in real-world environments (Schulz, 2012). To address this discrepancy between naıve deep RL exploration strategies and human capabilities, we present a novel method which learns to propose Adversarially Motivated Intrinsic Goals (AMIGO)
  • We make the following contributions: (i) we propose Adversarially Motivated Intrinsic GOals—an approach for learning a teacher that generates increasingly harder goals, (ii) we show, through 114 experiments on 6 challenging exploration tasks in procedurally generated environments, that agents trained with AMIGO gradually learn to interact with the environment and solve tasks which are too difficult for state-of-the-art methods, and (iii) we perform an extensive qualitative analysis and ablation study
  • As discussed in Section 4.3, the reported result for each baseline and each environment is that of the best performing configuration for the policy and intrinsic motivation system for that environment, as reported in Tables 2–5 of Appendix A. This aggregation of 114 experiments ensures that each baseline is given the opportunity to perform in its best setting, in order to fairly benchmark the performance of AMIGO
  • We propose AMIGO, a framework for generating a natural curriculum of goals that help train an agent as a form of intrinsic reward, to supplement extrinsic reward
  • We demonstrate that AMIGO surpasses state-of-the-art intrinsic motivation methods in challenging procedurally-generated tasks in a comprehensive comparison against multiple competitive baselines, in a series of 114 experiments across 6 tasks
Methods
  • The authors follow Raileanu & Rocktaschel (2020) and evaluate the models on several challenging procedurally-generated environments from MiniGrid (Chevalier-Boisvert et al, 2018).
  • This environment provides a good testbed for exploration in RL since the observations are symbolic rather than high-dimensional, which helps to disentangle the problem of exploration from that of visual understanding.
  • Episodes end when the goal is reached, and the scale of the positive reward encourages agents to reach the goal as quickly as possible
Results
  • RESULTS AND DISCUSSION

    The authors summarize the main results of the experiments in Table 1.
  • As discussed in Section 4.3, the reported result for each baseline and each environment is that of the best performing configuration for the policy and intrinsic motivation system for that environment, as reported in Tables 2–5 of Appendix A.
  • This aggregation of 114 experiments ensures that each baseline is given the opportunity to perform in its best setting, in order to fairly benchmark the performance of AMIGO.
  • For IMPALA, the numbers reported for KCmedium and OMmedium are from the experiments in Raileanu & Rocktaschel (2020), while the numbers for the harder environments are presumed to be .00 because IMPALA fails to train on simpler environments.
Conclusion
  • The authors propose AMIGO, a framework for generating a natural curriculum of goals that help train an agent as a form of intrinsic reward, to supplement extrinsic reward.
  • The choice of goal type imposed certain constraints on the nature of the observation, in that both the teacher and student need to fully observe the environment, due to the goals being provided as absolute coordinates
  • This method could be applied to partially observed environments where part of the full observation is uncertain or occluded (e.g.
  • The authors are confident the authors have proved the concept in a meaningful way, which other researchers will already be able to adapt to their model and RL algorithm of choice, in their domain of choice
Summary
  • Introduction:

    The success of Deep Reinforcement Learning (RL) on a wide range of tasks, while impressive, has so far been mostly confined to scenarios with reasonably dense rewards (e.g. Mnih et al, 2016; Vinyals et al, 2019), or to those where a perfect model of the environment can be used for search, such as the game of Go and others (e.g. Silver et al, 2016; Duan et al, 2016; Moravcık et al, 2017).
  • Children devote much of their time to play, generating objectives and posing challenges to themselves as a form of intrinsic motivation
  • Solving such self-proposed tasks encourages them to explore, experiment, and invent; sometimes, as in many games and fantasies, without any direct link to reality or to any source of extrinsic reward.
  • This kind of intrinsic motivation might be a crucial feature to enable learning in real-world environments (Schulz, 2012)
  • Methods:

    The authors follow Raileanu & Rocktaschel (2020) and evaluate the models on several challenging procedurally-generated environments from MiniGrid (Chevalier-Boisvert et al, 2018).
  • This environment provides a good testbed for exploration in RL since the observations are symbolic rather than high-dimensional, which helps to disentangle the problem of exploration from that of visual understanding.
  • Episodes end when the goal is reached, and the scale of the positive reward encourages agents to reach the goal as quickly as possible
  • Results:

    RESULTS AND DISCUSSION

    The authors summarize the main results of the experiments in Table 1.
  • As discussed in Section 4.3, the reported result for each baseline and each environment is that of the best performing configuration for the policy and intrinsic motivation system for that environment, as reported in Tables 2–5 of Appendix A.
  • This aggregation of 114 experiments ensures that each baseline is given the opportunity to perform in its best setting, in order to fairly benchmark the performance of AMIGO.
  • For IMPALA, the numbers reported for KCmedium and OMmedium are from the experiments in Raileanu & Rocktaschel (2020), while the numbers for the harder environments are presumed to be .00 because IMPALA fails to train on simpler environments.
  • Conclusion:

    The authors propose AMIGO, a framework for generating a natural curriculum of goals that help train an agent as a form of intrinsic reward, to supplement extrinsic reward.
  • The choice of goal type imposed certain constraints on the nature of the observation, in that both the teacher and student need to fully observe the environment, due to the goals being provided as absolute coordinates
  • This method could be applied to partially observed environments where part of the full observation is uncertain or occluded (e.g.
  • The authors are confident the authors have proved the concept in a meaningful way, which other researchers will already be able to adapt to their model and RL algorithm of choice, in their domain of choice
Tables
  • Table1: Comparison of Mean Extrinsic Reward at the end of training (averaging over a batch of episodes as in IMPALA). Each entry shows the result of the best observation configuration, for each baseline, from Tables 2–5 of Appendix A
  • Table2: Fully observed intrinsic reward, fully observed policy
  • Table3: Partially observed intrinsic reward, fully observed policy
  • Table4: Fully observed intrinsic reward, partially observed policy
  • Table5: Partially observed intrinsic reward, partially observed policy
  • Table6: Ablations and Alternatives. Number of steps (in millions) for models to learn to reach its final level of reward in the different environments (0 means the model did not learn to get any extrinsic reward). FULL MODEL is the main algorithm described above. NOEXTRINSIC does not provide any extrinsic reward to the teacher. NOENVCHANGE removes the reward for selecting goals that change as a result of episode resets. WITHNOVELTY adds a novelty bonus that decreases depending on the number of times an object has been successfully proposed. GAUSSIAN and LINEAR-EXPONENTIAL explore alternative reward functions for the teacher
Download tables as Excel
Related work
Reference
  • Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30:5048–5058, 2017.
    Google ScholarLocate open access versionFindings
  • Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM, 2009.
    Google ScholarLocate open access versionFindings
  • Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A. Efros. Largescale study of curiosity-driven learning. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJNwDjAqYX.
    Locate open access versionFindings
  • Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=H1lJJnR5Ym.
    Locate open access versionFindings
  • Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for OpenAI gym. https://github.com/maximecb/gym-minigrid, 2018.
    Findings
  • Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. arXiv preprint arXiv:1912.01588, 2019.
    Findings
  • Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. ArXiv, abs/1604.06778, 2016.
    Findings
  • Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
    Findings
  • Meng Fang, Tianyi Zhou, Yali Du, Lei Han, and Zhengyou Zhang. Curriculum-guided hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 12623–12634, 2019.
    Google ScholarLocate open access versionFindings
  • Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300, 2017.
    Findings
  • Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for reinforcement learning agents. In Proceedings of the 35th International Conference on Machine Learning, 2018. URL http://proceedings.mlr.press/v80/florensa18a.html.
    Locate open access versionFindings
  • John Foley, Emma Tosch, Kaleigh Clary, and David Jensen. Toybox: Better atari environments for testing reinforcement learning agents. arXiv preprint arXiv:1812.02850, 2018.
    Findings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc., 20URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.
    Locate open access versionFindings
  • Alexander S Klyubin, Daniel Polani, and Chrystopher L Nehaniv. Empowerment: A universal agent-centric measure of control. In 2005 IEEE Congress on Evolutionary Computation, volume 1, pp. 128–135. IEEE, 2005.
    Google ScholarLocate open access versionFindings
  • Heinrich Kuttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, Viswanath Sivakumar, Tim Rocktaschel, and Edward Grefenstette. TorchBeast: A PyTorch platform for distributed RL. arXiv preprint arXiv:1910.03552, 2019.
    Findings
  • Heinrich Kuttler, Nantas Nardelli, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rocktaschel. The NetHack Learning Environment. In Workshop on Beyond Tabula Rasa in Reinforcement Learning (BeTR-RL), 2020. URL https://github.com/facebookresearch/nle.
    Locate open access versionFindings
  • Nicolas Lair, Cedric Colas, Remy Portelas, Jean-Michel Dussoux, Peter Ford Dominey, and PierreYves Oudeyer. Language grounding through social interactions and curiosity-driven multi-goal learning. arXiv preprint arXiv:1911.03219, 2019.
    Findings
  • Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
    Google ScholarLocate open access versionFindings
  • Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning. IEEE transactions on neural networks and learning systems, 2017.
    Google ScholarFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Matej Moravcık, Martin Schmid, Neil Burch, Viliam Lisy, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. DeepStack: Expert-level artificial intelligence in no-limit poker. ArXiv, abs/1701.01724, 2017.
    Findings
  • Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. arXiv preprint arXiv:2003.04960, 2020.
    Findings
  • Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and Remi Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2721–2730. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
    Google ScholarLocate open access versionFindings
  • Pierre-Yves Oudeyer, Frdric Kaplan, and Verena V Hafner. Intrinsic motivation systems for autonomous mental development. IEEE transactions on evolutionary computation, 11(2):265–286, 2007.
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Pulkit Agrawal, Alexei Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17, 2017.
    Google ScholarLocate open access versionFindings
  • Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019.
    Findings
  • Remy Portelas, Cedric Colas, Katja Hofmann, and Pierre-Yves Oudeyer. Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments. In Conference on Robot Learning, pp. 835–853. PMLR, 2020a.
    Google ScholarLocate open access versionFindings
  • Remy Portelas, Cedric Colas, Lilian Weng, Katja Hofmann, and Pierre-Yves Oudeyer. Automatic curriculum learning for deep rl: A short survey. arXiv preprint arXiv:2003.04664, 2020b.
    Findings
  • Sebastien Racaniere, Andrew K Lampinen, Adam Santoro, David P Reichert, Vlad Firoiu, and Timothy P Lillicrap. Automated curricula through setter-solver interactions. arXiv preprint arXiv:1909.12892, 2019.
    Findings
  • Roberta Raileanu and Tim Rocktaschel. {RIDE}: Rewarding impact-driven exploration for procedurally-generated environments. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pp. 6550–6561, 2017.
    Google ScholarLocate open access versionFindings
  • Sebastian Risi and Julian Togelius. Procedural content generation: From automatically generating game levels to increasing generality in machine learning. arXiv preprint arXiv:1911.13071, 2019.
    Findings
  • Jurgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227, 1991.
    Google ScholarLocate open access versionFindings
  • Jurgen Schmidhuber. Powerplay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. In Front. Psychol., 2011.
    Google ScholarLocate open access versionFindings
  • David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
    Google ScholarLocate open access versionFindings
  • Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, pp. 2601–2606. Cognitive Science Society, 2009.
    Google ScholarLocate open access versionFindings
  • Jonathan Sorg, Satinder P Singh, and Richard L Lewis. Internal rewards mitigate agent boundedness. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 1007– 1014, 2010.
    Google ScholarLocate open access versionFindings
  • Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
    Findings
  • Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
    Google ScholarLocate open access versionFindings
  • Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017.
    Findings
  • Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michael Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
    Google ScholarLocate open access versionFindings
  • Chang Ye, Ahmed Khalifa, Philip Bontrager, and Julian Togelius. Rotation, translation, and cropping for zero-shot generalization. CoRR, abs/2001.09908, 2020. URL https://arxiv.org/abs/2001.09908.
    Findings
  • Amy Zhang, Nicolas Ballas, and Joelle Pineau. A dissection of overfitting and generalization in continuous reinforcement learning. arXiv preprint arXiv:1806.07937, 2018.
    Findings
  • Yunzhi Zhang, Pieter Abbeel, and Lerrel Pinto. Automatic curriculum learning through value disagreement, 2020.
    Google ScholarFindings
  • Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 4644–4654, 2018.
    Google ScholarLocate open access versionFindings
  • Victor Zhong, Tim Rocktaschel, and Edward Grefenstette. RTFM: generalising to new environment dynamics via reading. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJgob6NKvH.
    Locate open access versionFindings
Your rating :
0

 

Tags
Comments