## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Discovering Reinforcement Learning Algorithms

NIPS 2020, (2020)

EI

Keywords

Abstract

Reinforcement learning (RL) algorithms update an agent's parameters according to one of several possible rules, discovered manually through years of research. Automating the discovery of update rules from data could lead to more efficient algorithms, or algorithms that are better adapted to specific environments. Although there have bee...More

Code:

Data:

Introduction

- Reinforcement learning (RL) has a clear objective: to maximise expected cumulative rewards, which is simple, yet general enough to capture many aspects of intelligence.
- Recent work has shown that it is possible to meta-learn a policy update rule when given a value function, and that the resulting update rule can generalise to similar or unseen tasks
- It remains an open question whether it is feasible to discover fundamental concepts of RL entirely from scratch.
- A defining aspect of RL algorithms is their ability to learn and utilise value functions
- Discovering concepts such as value functions requires an understanding of both ‘what to predict’ and ‘how to make use of the prediction’.

Highlights

- Reinforcement learning (RL) has a clear objective: to maximise expected cumulative rewards, which is simple, yet general enough to capture many aspects of intelligence
- An appealing alternative approach is to automatically discover RL algorithms from data generated by interaction with a set of environments, which can be formulated as a meta-learning problem
- We evaluated the ability of the discovered RL algorithm to generalise to new environments
- Regularisation We find that the optimisation can be very hard and unstable, mainly because the Learned Policy Gradient (LPG) needs to learn an appropriate semantics of predictions y, as well as learning to use predictions y properly for bootstrapping without access to the value function
- The results from a small set of toy environments showed that the discovered LPG maintains rich information in the prediction, which was crucial for efficient bootstrapping
- The radical generalisation from the toy domains to Atari games shows that it may be feasible to discover an efficient RL algorithm from interactions with environments, which would potentially lead to entirely new approaches to RL

Conclusion

- This paper made the first attempt to meta-learn a full RL update rule by jointly discovering both ‘what to predict’ and ‘how to bootstrap’, replacing existing RL concepts such as value function and TD-learning.
- The results from a small set of toy environments showed that the discovered LPG maintains rich information in the prediction, which was crucial for efficient bootstrapping.
- The authors believe this is just the beginning of the fully data-driven discovery of RL algorithms; there are many promising directions to extend the work, from procedural generation of environments, to new advanced architectures and alternative ways to generate experience.
- If the proposed research direction succeeds, this could shift the research paradigm from manually developing RL algorithms to building a proper set of environments so that the resulting algorithm is efficient

Summary

## Introduction:

Reinforcement learning (RL) has a clear objective: to maximise expected cumulative rewards, which is simple, yet general enough to capture many aspects of intelligence.- Recent work has shown that it is possible to meta-learn a policy update rule when given a value function, and that the resulting update rule can generalise to similar or unseen tasks
- It remains an open question whether it is feasible to discover fundamental concepts of RL entirely from scratch.
- A defining aspect of RL algorithms is their ability to learn and utilise value functions
- Discovering concepts such as value functions requires an understanding of both ‘what to predict’ and ‘how to make use of the prediction’.
## Conclusion:

This paper made the first attempt to meta-learn a full RL update rule by jointly discovering both ‘what to predict’ and ‘how to bootstrap’, replacing existing RL concepts such as value function and TD-learning.- The results from a small set of toy environments showed that the discovered LPG maintains rich information in the prediction, which was crucial for efficient bootstrapping.
- The authors believe this is just the beginning of the fully data-driven discovery of RL algorithms; there are many promising directions to extend the work, from procedural generation of environments, to new advanced architectures and alternative ways to generate experience.
- If the proposed research direction succeeds, this could shift the research paradigm from manually developing RL algorithms to building a proper set of environments so that the resulting algorithm is efficient

- Table1: Methods for discovering RL algorithms
- Table2: Meta-hyperparameters for meta-training
- Table3: Agent hyperparameters for each training environment
- Table4: Hyperparameters used for meta-testing on Atari games

Related work

- Early Work on Learning to Learn The idea of learning to learn has been discussed for a long time with various formulations such as improving genetic programming [26], learning a neural network update rule [3], learning rate adaptations [29], self-weight-modifying RNNs [27], and transfer of domain-invariant knowledge [31]. Such work showed that it is possible to learn not only to optimise fixed objectives, but also to improve the way to optimise at a meta-level.

Learning to Learn for Few-Shot Task Adaptation Learning to learn has received much attention in the context of few-shot learning [25, 33]. MAML [9, 10] allows to meta-learn initial parameters by backpropagating through the parameter updates. RL2 [7, 34] formulates learning itself as an RL problem by unrolling LSTMs [14] across the agent’s entire lifetime. Other approaches include simple approximation [23], RNNs with Hebbian learning [19, 20], and gradient preconditioning [11]. All these do not clearly separate between agent and algorithm, thus, the resulting meta-learned algorithms are specific to a single agent architecture by definition of the problem.

Reference

- M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449–458. JMLR. org, 2017.
- M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. Citeseer, 1990.
- J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018.
- Y. Chebotar, A. Molchanov, S. Bechtle, L. Righetti, F. Meier, and G. Sukhatme. Meta-learning via learned loss. arXiv preprint arXiv:1906.05374, 2019.
- W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl 2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
- L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
- C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
- C. Finn and S. Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. arXiv preprint arXiv:1710.11622, 2017.
- S. Flennerhag, A. A. Rusu, R. Pascanu, H. Yin, and R. Hadsell. Meta-learning with warped gradient descent. arXiv preprint arXiv:1909.00025, 2019.
- M. Hausknecht, J. Lehman, R. Miikkulainen, and P. Stone. A neuroevolution approach to general atari game playing. IEEE Transactions on Computational Intelligence and AI in Games, 6(4):355–366, 2014.
- M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997.
- R. Houthooft, Y. Chen, P. Isola, B. Stadie, F. Wolski, O. J. Ho, and P. Abbeel. Evolved policy gradients. In Advances in Neural Information Processing Systems, pages 5400–5409, 2018.
- N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, R. C. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. In-datacenter performance analysis of a tensor processing unit. CoRR, abs/1704.04760, 2017.
- L. Kirsch, S. van Steenkiste, and J. Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. In International Conference on Learning Representations, 2020.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- T. Miconi, J. Clune, and K. O. Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. arXiv preprint arXiv:1804.02464, 2018.
- T. Miconi, A. Rawal, J. Clune, and K. O. Stanley. Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity. arXiv preprint arXiv:2002.10585, 2020.
- V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
- I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Lattimore, C. Szepezvari, S. Singh, et al. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019.
- A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016.
- J. Schmidhuber. Evolutionary principles in self-referential learning. Master’s thesis, Technische Universitat Munchen, Germany, 1987.
- [28] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, 2014.
- [29] R. S. Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI, pages 171–176, 1992.
- [30] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
- [31] S. Thrun and T. M. Mitchell. Learning one more thing. In IJCAI, 1995.
- [32] V. Veeriah, M. Hessel, Z. Xu, J. Rajendran, R. L. Lewis, J. Oh, H. P. van Hasselt, D. Silver, and S. Singh. Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, pages 9306–9317, 2019.
- [33] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
- [34] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- [35] Y. Wang, Q. Ye, and T.-Y. Liu. Beyond exponentially discounted sum: Automatic learning of return function. arXiv preprint arXiv:1905.11591, 2019.
- [36] Z. Xu, H. van Hasselt, M. Hessel, J. Oh, S. Singh, and D. Silver. Meta-gradient reinforcement learning with an objective discovered online. arXiv preprint, 2020.
- [37] Z. Xu, H. P. van Hasselt, and D. Silver. Meta-gradient reinforcement learning. In Advances in neural information processing systems, pages 2396–2407, 2018.
- [38] T. Zahavy, Z. Xu, V. Veeriah, M. Hessel, J. Oh, H. van Hasselt, D. Silver, and S. Singh. Self-tuning deep reinforcement learning. arXiv preprint arXiv:2002.12928, 2020.
- [39] Z. Zheng, J. Oh, M. Hessel, Z. Xu, M. Kroiss, H. van Hasselt, D. Silver, and S. Singh. What can learned intrinsic rewards capture? arXiv preprint arXiv:1912.05500, 2019.
- [40] Z. Zheng, J. Oh, and S. Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pages 4644–4654, 2018.
- [41] W. Zhou, Y. Li, Y. Yang, H. Wang, and T. M. Hospedales. Online meta-critic learning for off-policy actor-critic methods. arXiv preprint arXiv:2003.05334, 2020. State index (integer) 9 or 18 7×9 2 × [1, 0.1, 0.01], 5 × [−1, 0.8, 1] 2000
- State index (integer) 9 or 18 11 × 11 4 × [1, 0, 0.005] 2000
- {0, 1}N×H×W 9 or 18 11 × 11 [1, 0, 1] 2000

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn