AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
The nature of the meta-network, and the objective of the reinforcement learning algorithm, is discovered by meta-gradient descent over the sequence of updates based upon the discovered target
Meta-Gradient Reinforcement Learning with an Objective Discovered Online
NIPS 2020, (2020)
Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an...More
PPT (Upload PPT)
- Recent advances in supervised and unsupervised learning have been driven by a transition from handcrafted expert features to deep representations ; these are typically learned by gradient descent on a suitable objective function to adjust a rich parametric function approximator.
- The authors applied the algorithm for online discovery of an off-policy learning objective to independent training runs on each of 57 classic Atari games.
- The authors describe the proposed algorithm for online learning of reinforcement learning objectives using meta-gradients.
- The authors train the meta-network using an end-to-end meta-gradient algorithm, so as to learn an update target that leads to good subsequent performance.
- Recent advances in supervised and unsupervised learning have been driven by a transition from handcrafted expert features to deep representations ; these are typically learned by gradient descent on a suitable objective function to adjust a rich parametric function approximator
- Reinforcement learning (RL) has largely embraced the transition from handcrafting features to handcrafting objectives: deep function approximation has been successfully combined with ideas such as temporal difference (TD)-learning [29, 33], Q-learning [41, 22], double Q-learning [35, 36], n-step updates [31, 13], general value functions [32, 17], distributional value functions [7, 3], policy gradients [42, 20] and a variety of off-policy actor-critics [8, 10, 28]
- We proposed an algorithm that allows reinforcement learning (RL) agents to learn their own objective during online interactions with their environment
- The nature of the meta-network, and the objective of the RL algorithm, is discovered by meta-gradient descent over the sequence of updates based upon the discovered target
- Our results in toy domains demonstrate that FRODO can successfully discover how to address key issues in RL, such as bootstrapping and non-stationarity, through online adaptation of its objective
- Our results in Atari demonstrate that FRODO can successfully discover and adapt off-policy learning objectives that are distinct from, and performed better than, strong benchmark RL algorithms
- A different way to learn the RL objective is to directly parameterise a loss by a meta-network [2, 19], rather than the target of a loss.
- Τi+M , τi+M+1}, the authors apply multiple steps of gradient descent updates to the agent θ according to the inner losses Liηnner(τi, θi).
- The meta-gradient algorithm above can be applied to any differentiable component of the update rule, for example to learn the discount factor γ and bootstrapping factor λ , intrinsic rewards [46, 45], and auxiliary tasks .
- The authors apply meta-gradients to learn the meta-parameters of the update target gη online, where η are the parameters of a neural network.
- After M updates, the authors compute the outer loss Louter from a validation trajectory τ as the squared difference between the predicted value and a canonical multi-step bootstrapped return G(τ ), as used in classic RL:
- It receives the rewards Rt, discounts γt, and, as in the motivating examples from Section 4, the values from future time-steps v(St+1), to allow bootstrapping from the learned predictions.
- This allows the inner loss to potentially discover off-policy algorithms, by constructing suitable off-policy update targets for the policy and value function.
- The authors applied the FRODO algorithm to learn a target online, using an outer loss based on the actorcritic algorithm IMPALA , and using a consistency loss was included with c = 0.1.
- In Figure 3a the authors see that the meta-gradient algorithm learned slowly and gradually to discover an effective objective.
- Over time the meta-gradient algorithm learned to learn more rapidly, overtaking the actor-critic baseline and achieving significantly stronger final results.
- The objective, the target used to update the policy and value function, is parameterised by a deep neural meta-network.
- The nature of the meta-network, and the objective of the RL algorithm, is discovered by meta-gradient descent over the sequence of updates based upon the discovered target.
- The authors' results in Atari demonstrate that FRODO can successfully discover and adapt off-policy learning objectives that are distinct from, and performed better than, strong benchmark RL algorithms.
- These examples illustrate the generality of the proposed method, and suggest its potential both to recover existing concepts, and to discover new concepts for RL algorithms
- Table1: Detailed hyper-parameters for Atari experiments
- The idea of learning to learn by gradient descent has a long history. In supervised learning, IDBD and SMD [30, 27] used a meta-gradient approach to adapt the learning rate online so as to optimise future performance. “Learning by gradient descent to learn by gradient descent"  used meta-gradients, offline and over multiple lifetimes, to learn a gradient-based optimiser, parameterised by a “black-box” neural network. MAML  and REPTILE  also use meta-gradients, offline and over multiple lifetimes, to learn initial parameters that can be optimised more efficiently.
In reinforcement learning, methods such as meta reinforcement learning  and RL2  allow a recurrent network to jointly represent, in its activations, both the agent’s representation of state and also its internal parameters. Xu et al  introduced metagradients as a general but efficient approach for optimising the meta-parameters of gradient-based RL agents. This approach has since been applied to many different meta-parameters of RL algorithms, such as the discount γ and bootstrapping parameter λ , intrinsic rewards [46, 45], auxiliary tasks , off-policy corrections , and to parameterise returns as a linear combination of rewards  (without any bootstrapping). The metagradient approach has also been applied, offline and over multiple lifetimes, to black-box parameterisations, via deep neural networks, of the entire RL algorithm [2, 19] and  (contemporaneous work); evolutionary approaches have also been applied .
- M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pages 3981–3989, 2016.
- S. Bechtle, A. Molchanov, Y. Chebotar, E. Grefenstette, L. Righetti, G. Sukhatme, and F. Meier. Meta-learning via learned loss. arXiv preprint arXiv:1906.05374, 2019.
- M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. In ICML, 2017.
- M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018.
- D. Budden, M. Hessel, J. Quan, and S. Kapturowski. RLax: Reinforcement Learning in JAX, 2020.
- K.-J. Chung and M. J. Sobel. Discounted mdp’s: distribution functions and exponential utility maximization. SIAM Journal on Control and Optimization, 1987.
- T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. In ICML, 2012.
- Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
- L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML, 2018.
- C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
- T. Hennigan, T. Cai, T. Norman, and I. Babuschkin. Haiku: Sonnet for JAX, 2020.
- M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997.
- R. Houthooft, Y. Chen, P. Isola, B. Stadie, F. Wolski, O. J. Ho, and P. Abbeel. Evolved policy gradients. In Advances in Neural Information Processing Systems, pages 5400–5409, 2018.
- M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. ICLR, 2016.
- N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, et al. In-Datacenter performance analysis of a Tensor Processing Unit. ISCA, 2017.
- L. Kirsch, S. van Steenkiste, and J. Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. ICLR, 2020.
- V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
- V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, Feb. 2015.
- A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
- J. Oh, M. Hessel, W. Czarnecki, Z. Xu, H. van Hasselt, S. Singh, and D. Silver. Discovering reinforcement learning algorithms. arXiv preprint, 2020.
- M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994.
- G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
- N. N. Schraudolph. Local gain adaptation in stochastic gradient descent. ICANN, 1999.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
- R. S. Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI, pages 171–176, 1992.
- R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
- R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In L. Sonenberg, P. Stone, K. Tumer, and P. Yolum, editors, AAMAS, pages 761–768. IFAAMAS, 2011.
- G. Tesauro. Temporal difference learning and TD-Gammon. Commun. ACM, 38(3):58–68, Mar. 1995.
- T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
- H. van Hasselt. Double Q-learning. In Advances in neural information processing systems, pages 2613–2621, 2010.
- H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In AAAI, 2016.
- H. van Seijen, H. van Hasselt, S. Whiteson, and M. Wiering. A theoretical and empirical analysis of Expected Sarsa. In IEEE symposium on adaptive dynamic programming and reinforcement learning, pages 177–184. IEEE, 2009.
- V. Veeriah, M. Hessel, Z. Xu, J. Rajendran, R. L. Lewis, J. Oh, H. P. van Hasselt, D. Silver, and S. Singh. Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, pages 9306–9317, 2019.
- J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- Y. Wang, Q. Ye, and T.-Y. Liu. Beyond exponentially discounted sum: Automatic learning of return function. arXiv preprint arXiv:1905.11591, 2019.
- C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Z. Xu, H. P. van Hasselt, and D. Silver. Meta-gradient reinforcement learning. In Advances in neural information processing systems, pages 2396–2407, 2018.
- T. Zahavy, Z. Xu, V. Veeriah, M. Hessel, J. Oh, H. van Hasselt, D. Silver, and S. Singh. Self-tuning deep reinforcement learning. arXiv preprint arXiv:2002.12928, 2020.
- Z. Zheng, J. Oh, M. Hessel, Z. Xu, M. Kroiss, H. van Hasselt, D. Silver, and S. Singh. What can learned intrinsic rewards capture? ICML, 2020.
- Z. Zheng, J. Oh, and S. Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pages 4644–4654, 2018.