AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We showed that the learned intrinsic rewards can generalise to different agent-environment interfaces such as different action spaces and different learning algorithms, whereas policy transfer methods fail to generalise

What Can Learned Intrinsic Rewards Capture?

ICML, pp.11436-11446, (2020)

被引用9|浏览149
EI
下载 PDF 全文
引用
微博一下

摘要

Reinforcement learning agents can include different components, such as policies, value functions, state representations, and environment models. Any or all of these can be the loci of knowledge, i.e., structures where knowledge, whether given or learned, can be deposited and reused. The objective of an agent is to behave so as to maxim...更多

代码

数据

0
简介
  • Most existing work on intrinsic rewards falls into two broad categories: task-dependent and taskindependent
  • Both are typically designed by hand.
  • Task-independent intrinsic rewards are typically hand-designed, often based on an intuitive understanding of animal/human behaviour or on heuristics on desired exploratory behaviour.
  • It can, be hard to match such task-independent intrinsic rewards to the specific learning dynamics induced by the interaction between agent and environment.
  • Instead of comparing different loci of knowledge, the purpose of this paper is to show that it is feasible to capture useful learned knowledge in rewards and to study the kinds of knowledge that can be captured
重点内容
  • Most existing work on intrinsic rewards falls into two broad categories: task-dependent and taskindependent
  • Our method can be interpreted as an instance of RL2 with a particular decomposition of parameters (θ and η), which uses policy gradient as a recurrent update. While this modular structure may not be more beneficial than RL2 when evaluated with the same agent-environment interface, such a decomposition provides clear semantics of each module: the policy (θ) captures “how to do” while the intrinsic reward (η) captures “what to do”, and this enables interesting kinds of generalisations as we show below
  • Generalisation to unseen learning algorithms We further investigated how general the knowledge captured by the intrinsic reward is by evaluating the learned intrinsic reward on agents with different learning algorithms
  • Through several proof-of-concept experiments, we showed that the learned non-stationary intrinsic reward can capture regularities within a distribution of environments or, over time, within a non-stationary environment
  • We showed that the learned intrinsic rewards can generalise to different agent-environment interfaces such as different action spaces and different learning algorithms, whereas policy transfer methods fail to generalise
  • The flexibility and range of knowledge captured by intrinsic rewards in our proof-of-concept experiments encourages further work towards combining different loci of knowledge to achieve greater practical benefits
结论
  • The authors revisited the optimal reward problem (Singh et al, 2009) and proposed a more scalable gradientbased method for learning intrinsic rewards.
  • The authors showed that the learned intrinsic rewards can generalise to different agent-environment interfaces such as different action spaces and different learning algorithms, whereas policy transfer methods fail to generalise.
  • This highlights the difference between the “what” kind of knowledge captured by rewards and the “how” kind of knowledge captured by policies.
  • The flexibility and range of knowledge captured by intrinsic rewards in the proof-of-concept experiments encourages further work towards combining different loci of knowledge to achieve greater practical benefits
总结
  • Introduction:

    Most existing work on intrinsic rewards falls into two broad categories: task-dependent and taskindependent
  • Both are typically designed by hand.
  • Task-independent intrinsic rewards are typically hand-designed, often based on an intuitive understanding of animal/human behaviour or on heuristics on desired exploratory behaviour.
  • It can, be hard to match such task-independent intrinsic rewards to the specific learning dynamics induced by the interaction between agent and environment.
  • Instead of comparing different loci of knowledge, the purpose of this paper is to show that it is feasible to capture useful learned knowledge in rewards and to study the kinds of knowledge that can be captured
  • Conclusion:

    The authors revisited the optimal reward problem (Singh et al, 2009) and proposed a more scalable gradientbased method for learning intrinsic rewards.
  • The authors showed that the learned intrinsic rewards can generalise to different agent-environment interfaces such as different action spaces and different learning algorithms, whereas policy transfer methods fail to generalise.
  • This highlights the difference between the “what” kind of knowledge captured by rewards and the “how” kind of knowledge captured by policies.
  • The flexibility and range of knowledge captured by intrinsic rewards in the proof-of-concept experiments encourages further work towards combining different loci of knowledge to achieve greater practical benefits
表格
  • Table1: Hyperparameters
Download tables as Excel
相关工作
引用论文
  • Sarah Bechtle, Artem Molchanov, Yevgen Chebotar, Edward Grefenstette, Ludovic Righetti, Gaurav Sukhatme, and Franziska Meier. Meta-learning via learned loss. arXiv preprint arXiv:1906.05374, 2019.
    Findings
  • Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • Jack Clark and Dario Amodei. Faulty reward functions in the wild. CoRR, 2016. URL https://blog.openai.com/.
    Locate open access versionFindings
  • Ignasi Clavera, Anusha Nagabandi, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt in dynamic, real-world environments through metareinforcement learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyztsoC5Y7.
    Locate open access versionFindings
  • Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
    Findings
  • Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in Neural Information Processing Systems, pp. 1087–1098, 2017.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. JMLR. org, 2017a.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. In Conference on Robot Learning, pp. 357–368, 2017b.
    Google ScholarLocate open access versionFindings
  • Goren Gordon and Ehud Ahissar. Reinforcement active learning hierarchical loops. In The 2011 International Joint Conference on Neural Networks, pp. 3008–3015. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • Xiaoxiao Guo, Satinder Singh, Richard Lewis, and Honglak Lee. Deep learning for reward design to improve monte carlo tree search in atari games. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 1519–1525. AAAI Press, 2016.
    Google ScholarLocate open access versionFindings
  • Anna Harutyunyan, Sam Devlin, Peter Vrancx, and Ann Nowe. Expressing arbitrary reward functions as potential-based advice. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2652–2658. AAAI Press, 2015.
    Google ScholarLocate open access versionFindings
  • Laurent Itti and Pierre F Baldi. Bayesian surprise attracts human attention. In Advances in neural information processing systems, pp. 547–554, 2006.
    Google ScholarLocate open access versionFindings
  • Cam Linke, Nadia M Ady, Martha White, Thomas Degris, and Adam White. Adapting behaviour via intrinsic reward: A survey and empirical study. arXiv preprint arXiv:1906.07865, 2019.
    Findings
  • Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-Dickstein. Meta-learning update rules for unsupervised representation learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HkNDsiC9KQ.
    Locate open access versionFindings
  • Andrew Y Ng, Daishi Harada, and Stuart J Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, pp. 278–287. Morgan Kaufmann Publishers Inc., 1999.
    Google ScholarLocate open access versionFindings
  • Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and Remi Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2721–2730. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Pierre-Yves Oudeyer, Frdric Kaplan, and Verena V Hafner. Intrinsic motivation systems for autonomous mental development. IEEE transactions on evolutionary computation, 11(2):265–286, 2007.
    Google ScholarLocate open access versionFindings
  • Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2778–2787. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 697–704. ACM, 2006.
    Google ScholarLocate open access versionFindings
  • Jette Randlov and Preben Alstrom. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the Fifteenth International Conference on Machine Learning, pp. 463–471. Morgan Kaufmann Publishers Inc., 1998.
    Google ScholarLocate open access versionFindings
  • Matthew Schlegel, Andrew Patterson, Adam White, and Martha White. Discovery of predictive representations with a network of general value functions, 2018. URL https://openreview.net/forum?id=ryZElGZ0Z.
    Locate open access versionFindings
  • Juergen Schmidhuber, Jieyu Zhao, and MA Wiering. Simple principles of metalearning. Technical report IDSIA, 69:1–23, 1996.
    Google ScholarFindings
  • Jurgen Schmidhuber. Curious model-building control systems. In Proc. international joint conference on neural networks, pp. 1458–1463, 1991a.
    Google ScholarLocate open access versionFindings
  • Jurgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227, 1991b.
    Google ScholarLocate open access versionFindings
  • Satinder Singh, Richard L Lewis, and Andrew G Barto. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, pp. 2601–2606. Cognitive Science Society, 2009.
    Google ScholarLocate open access versionFindings
  • Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan Sorg. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, 2010.
    Google ScholarLocate open access versionFindings
  • Jonathan Sorg, Richard L Lewis, and Satinder Singh. Reward design via online gradient ascent. In Advances in Neural Information Processing Systems, pp. 2190–2198, 2010.
    Google ScholarLocate open access versionFindings
  • Bradly Stadie, Ge Yang, Rein Houthooft, Peter Chen, Yan Duan, Yuhuai Wu, Pieter Abbeel, and Ilya Sutskever. The importance of sampling inmeta-reinforcement learning. In Advances in Neural Information Processing Systems, pp. 9280–9290, 2018.
    Google ScholarLocate open access versionFindings
  • Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
    Google ScholarLocate open access versionFindings
  • Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, pp. 216–224. Morgan Kaufmann, 1990.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton, David A McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063, 2000.
    Google ScholarLocate open access versionFindings
  • Vivek Veeriah, Matteo Hessel, Zhongwen Xu, Janarthanan Rajendran, Richard L Lewis, Junhyuk Oh, Hado P van Hasselt, David Silver, and Satinder Singh. Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, pp. 9306–9317, 2019.
    Google ScholarLocate open access versionFindings
  • Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matthew M Botvinick. Learning to reinforcement learn. ArXiv, abs/1611.05763, 2016.
    Findings
  • Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Kelvin Xu, Ellis Ratner, Anca Dragan, Sergey Levine, and Chelsea Finn. Learning a prior over intent via meta-inverse reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 6952–6962, 2019.
    Google ScholarLocate open access versionFindings
  • Tianbing Xu, Qiang Liu, Liang Zhao, and Jian Peng. Learning to explore via meta-policy gradient. In International Conference on Machine Learning, pp. 5459–5468, 2018a.
    Google ScholarLocate open access versionFindings
  • Zhongwen Xu, Hado P van Hasselt, and David Silver. Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2396–2407, 2018b.
    Google ScholarLocate open access versionFindings
  • Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 4644–4654, 2018.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科