AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
The nature of the meta-network, and the objective of the reinforcement learning algorithm, is discovered by meta-gradient descent over the sequence of updates based upon the discovered target

Meta-Gradient Reinforcement Learning with an Objective Discovered Online

NIPS 2020, (2020)

Cited by: 8|Views319
EI
Full Text
Bibtex
Weibo

Abstract

Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an...More

Code:

Data:

0
Introduction
  • Recent advances in supervised and unsupervised learning have been driven by a transition from handcrafted expert features to deep representations [14]; these are typically learned by gradient descent on a suitable objective function to adjust a rich parametric function approximator.
  • The authors applied the algorithm for online discovery of an off-policy learning objective to independent training runs on each of 57 classic Atari games.
  • The authors describe the proposed algorithm for online learning of reinforcement learning objectives using meta-gradients.
  • The authors train the meta-network using an end-to-end meta-gradient algorithm, so as to learn an update target that leads to good subsequent performance.
Highlights
  • Recent advances in supervised and unsupervised learning have been driven by a transition from handcrafted expert features to deep representations [14]; these are typically learned by gradient descent on a suitable objective function to adjust a rich parametric function approximator
  • Reinforcement learning (RL) has largely embraced the transition from handcrafting features to handcrafting objectives: deep function approximation has been successfully combined with ideas such as temporal difference (TD)-learning [29, 33], Q-learning [41, 22], double Q-learning [35, 36], n-step updates [31, 13], general value functions [32, 17], distributional value functions [7, 3], policy gradients [42, 20] and a variety of off-policy actor-critics [8, 10, 28]
  • We proposed an algorithm that allows reinforcement learning (RL) agents to learn their own objective during online interactions with their environment
  • The nature of the meta-network, and the objective of the RL algorithm, is discovered by meta-gradient descent over the sequence of updates based upon the discovered target
  • Our results in toy domains demonstrate that FRODO can successfully discover how to address key issues in RL, such as bootstrapping and non-stationarity, through online adaptation of its objective
  • Our results in Atari demonstrate that FRODO can successfully discover and adapt off-policy learning objectives that are distinct from, and performed better than, strong benchmark RL algorithms
Results
  • A different way to learn the RL objective is to directly parameterise a loss by a meta-network [2, 19], rather than the target of a loss.
  • Τi+M , τi+M+1}, the authors apply multiple steps of gradient descent updates to the agent θ according to the inner losses Liηnner(τi, θi).
  • The meta-gradient algorithm above can be applied to any differentiable component of the update rule, for example to learn the discount factor γ and bootstrapping factor λ [43], intrinsic rewards [46, 45], and auxiliary tasks [38].
  • The authors apply meta-gradients to learn the meta-parameters of the update target gη online, where η are the parameters of a neural network.
  • After M updates, the authors compute the outer loss Louter from a validation trajectory τ as the squared difference between the predicted value and a canonical multi-step bootstrapped return G(τ ), as used in classic RL:
  • It receives the rewards Rt, discounts γt, and, as in the motivating examples from Section 4, the values from future time-steps v(St+1), to allow bootstrapping from the learned predictions.
  • This allows the inner loss to potentially discover off-policy algorithms, by constructing suitable off-policy update targets for the policy and value function.
  • The authors applied the FRODO algorithm to learn a target online, using an outer loss based on the actorcritic algorithm IMPALA [10], and using a consistency loss was included with c = 0.1.
  • In Figure 3a the authors see that the meta-gradient algorithm learned slowly and gradually to discover an effective objective.
  • Over time the meta-gradient algorithm learned to learn more rapidly, overtaking the actor-critic baseline and achieving significantly stronger final results.
Conclusion
  • The objective, the target used to update the policy and value function, is parameterised by a deep neural meta-network.
  • The nature of the meta-network, and the objective of the RL algorithm, is discovered by meta-gradient descent over the sequence of updates based upon the discovered target.
  • The authors' results in Atari demonstrate that FRODO can successfully discover and adapt off-policy learning objectives that are distinct from, and performed better than, strong benchmark RL algorithms.
  • These examples illustrate the generality of the proposed method, and suggest its potential both to recover existing concepts, and to discover new concepts for RL algorithms
Tables
  • Table1: Detailed hyper-parameters for Atari experiments
Download tables as Excel
Related work
  • The idea of learning to learn by gradient descent has a long history. In supervised learning, IDBD and SMD [30, 27] used a meta-gradient approach to adapt the learning rate online so as to optimise future performance. “Learning by gradient descent to learn by gradient descent" [1] used meta-gradients, offline and over multiple lifetimes, to learn a gradient-based optimiser, parameterised by a “black-box” neural network. MAML [11] and REPTILE [23] also use meta-gradients, offline and over multiple lifetimes, to learn initial parameters that can be optimised more efficiently.

    In reinforcement learning, methods such as meta reinforcement learning [39] and RL2 [9] allow a recurrent network to jointly represent, in its activations, both the agent’s representation of state and also its internal parameters. Xu et al [43] introduced metagradients as a general but efficient approach for optimising the meta-parameters of gradient-based RL agents. This approach has since been applied to many different meta-parameters of RL algorithms, such as the discount γ and bootstrapping parameter λ [43], intrinsic rewards [46, 45], auxiliary tasks [38], off-policy corrections [44], and to parameterise returns as a linear combination of rewards [40] (without any bootstrapping). The metagradient approach has also been applied, offline and over multiple lifetimes, to black-box parameterisations, via deep neural networks, of the entire RL algorithm [2, 19] and [24] (contemporaneous work); evolutionary approaches have also been applied [16].
Reference
  • M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pages 3981–3989, 2016.
    Google ScholarLocate open access versionFindings
  • S. Bechtle, A. Molchanov, Y. Chebotar, E. Grefenstette, L. Righetti, G. Sukhatme, and F. Meier. Meta-learning via learned loss. arXiv preprint arXiv:1906.05374, 2019.
    Findings
  • M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018.
    Google ScholarFindings
  • D. Budden, M. Hessel, J. Quan, and S. Kapturowski. RLax: Reinforcement Learning in JAX, 2020.
    Google ScholarFindings
  • K.-J. Chung and M. J. Sobel. Discounted mdp’s: distribution functions and exponential utility maximization. SIAM Journal on Control and Optimization, 1987.
    Google ScholarLocate open access versionFindings
  • T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. In ICML, 2012.
    Google ScholarLocate open access versionFindings
  • Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
    Findings
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • T. Hennigan, T. Cai, T. Norman, and I. Babuschkin. Haiku: Sonnet for JAX, 2020.
    Google ScholarFindings
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997.
    Google ScholarLocate open access versionFindings
  • R. Houthooft, Y. Chen, P. Isola, B. Stadie, F. Wolski, O. J. Ho, and P. Abbeel. Evolved policy gradients. In Advances in Neural Information Processing Systems, pages 5400–5409, 2018.
    Google ScholarLocate open access versionFindings
  • M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, et al. In-Datacenter performance analysis of a Tensor Processing Unit. ISCA, 2017.
    Google ScholarLocate open access versionFindings
  • L. Kirsch, S. van Steenkiste, and J. Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
    Google ScholarLocate open access versionFindings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, Feb. 2015.
    Google ScholarLocate open access versionFindings
  • A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
    Findings
  • J. Oh, M. Hessel, W. Czarnecki, Z. Xu, H. van Hasselt, S. Singh, and D. Silver. Discovering reinforcement learning algorithms. arXiv preprint, 2020.
    Google ScholarFindings
  • M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994.
    Google ScholarFindings
  • G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
    Google ScholarFindings
  • N. N. Schraudolph. Local gain adaptation in stochastic gradient descent. ICANN, 1999.
    Google ScholarLocate open access versionFindings
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
    Google ScholarLocate open access versionFindings
  • R. S. Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI, pages 171–176, 1992.
    Google ScholarLocate open access versionFindings
  • R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In L. Sonenberg, P. Stone, K. Tumer, and P. Yolum, editors, AAMAS, pages 761–768. IFAAMAS, 2011.
    Google ScholarFindings
  • G. Tesauro. Temporal difference learning and TD-Gammon. Commun. ACM, 38(3):58–68, Mar. 1995.
    Google ScholarLocate open access versionFindings
  • T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • H. van Hasselt. Double Q-learning. In Advances in neural information processing systems, pages 2613–2621, 2010.
    Google ScholarLocate open access versionFindings
  • H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • H. van Seijen, H. van Hasselt, S. Whiteson, and M. Wiering. A theoretical and empirical analysis of Expected Sarsa. In IEEE symposium on adaptive dynamic programming and reinforcement learning, pages 177–184. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • V. Veeriah, M. Hessel, Z. Xu, J. Rajendran, R. L. Lewis, J. Oh, H. P. van Hasselt, D. Silver, and S. Singh. Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, pages 9306–9317, 2019.
    Google ScholarLocate open access versionFindings
  • J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
    Findings
  • Y. Wang, Q. Ye, and T.-Y. Liu. Beyond exponentially discounted sum: Automatic learning of return function. arXiv preprint arXiv:1905.11591, 2019.
    Findings
  • C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
    Google ScholarLocate open access versionFindings
  • R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Z. Xu, H. P. van Hasselt, and D. Silver. Meta-gradient reinforcement learning. In Advances in neural information processing systems, pages 2396–2407, 2018.
    Google ScholarLocate open access versionFindings
  • T. Zahavy, Z. Xu, V. Veeriah, M. Hessel, J. Oh, H. van Hasselt, D. Silver, and S. Singh. Self-tuning deep reinforcement learning. arXiv preprint arXiv:2002.12928, 2020.
    Findings
  • Z. Zheng, J. Oh, M. Hessel, Z. Xu, M. Kroiss, H. van Hasselt, D. Silver, and S. Singh. What can learned intrinsic rewards capture? ICML, 2020.
    Google ScholarLocate open access versionFindings
  • Z. Zheng, J. Oh, and S. Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pages 4644–4654, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科