AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
View the video

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
The results from a small set of toy environments showed that the discovered Learned Policy Gradient maintains rich information in the prediction, which was crucial for efficient bootstrapping

Discovering Reinforcement Learning Algorithms

NIPS 2020, (2020)

Cited by: 12|Views292
EI
Full Text
Bibtex
Weibo

Abstract

Reinforcement learning (RL) algorithms update an agent's parameters according to one of several possible rules, discovered manually through years of research. Automating the discovery of update rules from data could lead to more efficient algorithms, or algorithms that are better adapted to specific environments. Although there have bee...More

Code:

Data:

0
Introduction
  • Reinforcement learning (RL) has a clear objective: to maximise expected cumulative rewards, which is simple, yet general enough to capture many aspects of intelligence.
  • Recent work has shown that it is possible to meta-learn a policy update rule when given a value function, and that the resulting update rule can generalise to similar or unseen tasks
  • It remains an open question whether it is feasible to discover fundamental concepts of RL entirely from scratch.
  • A defining aspect of RL algorithms is their ability to learn and utilise value functions
  • Discovering concepts such as value functions requires an understanding of both ‘what to predict’ and ‘how to make use of the prediction’.
Highlights
  • Reinforcement learning (RL) has a clear objective: to maximise expected cumulative rewards, which is simple, yet general enough to capture many aspects of intelligence
  • An appealing alternative approach is to automatically discover RL algorithms from data generated by interaction with a set of environments, which can be formulated as a meta-learning problem
  • We evaluated the ability of the discovered RL algorithm to generalise to new environments
  • Regularisation We find that the optimisation can be very hard and unstable, mainly because the Learned Policy Gradient (LPG) needs to learn an appropriate semantics of predictions y, as well as learning to use predictions y properly for bootstrapping without access to the value function
  • The results from a small set of toy environments showed that the discovered LPG maintains rich information in the prediction, which was crucial for efficient bootstrapping
  • The radical generalisation from the toy domains to Atari games shows that it may be feasible to discover an efficient RL algorithm from interactions with environments, which would potentially lead to entirely new approaches to RL
Conclusion
  • This paper made the first attempt to meta-learn a full RL update rule by jointly discovering both ‘what to predict’ and ‘how to bootstrap’, replacing existing RL concepts such as value function and TD-learning.
  • The results from a small set of toy environments showed that the discovered LPG maintains rich information in the prediction, which was crucial for efficient bootstrapping.
  • The authors believe this is just the beginning of the fully data-driven discovery of RL algorithms; there are many promising directions to extend the work, from procedural generation of environments, to new advanced architectures and alternative ways to generate experience.
  • If the proposed research direction succeeds, this could shift the research paradigm from manually developing RL algorithms to building a proper set of environments so that the resulting algorithm is efficient
Summary
  • Introduction:

    Reinforcement learning (RL) has a clear objective: to maximise expected cumulative rewards, which is simple, yet general enough to capture many aspects of intelligence.
  • Recent work has shown that it is possible to meta-learn a policy update rule when given a value function, and that the resulting update rule can generalise to similar or unseen tasks
  • It remains an open question whether it is feasible to discover fundamental concepts of RL entirely from scratch.
  • A defining aspect of RL algorithms is their ability to learn and utilise value functions
  • Discovering concepts such as value functions requires an understanding of both ‘what to predict’ and ‘how to make use of the prediction’.
  • Conclusion:

    This paper made the first attempt to meta-learn a full RL update rule by jointly discovering both ‘what to predict’ and ‘how to bootstrap’, replacing existing RL concepts such as value function and TD-learning.
  • The results from a small set of toy environments showed that the discovered LPG maintains rich information in the prediction, which was crucial for efficient bootstrapping.
  • The authors believe this is just the beginning of the fully data-driven discovery of RL algorithms; there are many promising directions to extend the work, from procedural generation of environments, to new advanced architectures and alternative ways to generate experience.
  • If the proposed research direction succeeds, this could shift the research paradigm from manually developing RL algorithms to building a proper set of environments so that the resulting algorithm is efficient
Tables
  • Table1: Methods for discovering RL algorithms
  • Table2: Meta-hyperparameters for meta-training
  • Table3: Agent hyperparameters for each training environment
  • Table4: Hyperparameters used for meta-testing on Atari games
Download tables as Excel
Related work
  • Early Work on Learning to Learn The idea of learning to learn has been discussed for a long time with various formulations such as improving genetic programming [26], learning a neural network update rule [3], learning rate adaptations [29], self-weight-modifying RNNs [27], and transfer of domain-invariant knowledge [31]. Such work showed that it is possible to learn not only to optimise fixed objectives, but also to improve the way to optimise at a meta-level.

    Learning to Learn for Few-Shot Task Adaptation Learning to learn has received much attention in the context of few-shot learning [25, 33]. MAML [9, 10] allows to meta-learn initial parameters by backpropagating through the parameter updates. RL2 [7, 34] formulates learning itself as an RL problem by unrolling LSTMs [14] across the agent’s entire lifetime. Other approaches include simple approximation [23], RNNs with Hebbian learning [19, 20], and gradient preconditioning [11]. All these do not clearly separate between agent and algorithm, thus, the resulting meta-learned algorithms are specific to a single agent architecture by definition of the problem.
Reference
  • M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449–458. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. Citeseer, 1990.
    Google ScholarLocate open access versionFindings
  • J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018.
    Google ScholarFindings
  • Y. Chebotar, A. Molchanov, S. Bechtle, L. Righetti, F. Meier, and G. Sukhatme. Meta-learning via learned loss. arXiv preprint arXiv:1906.05374, 2019.
    Findings
  • W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl 2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
    Findings
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
    Findings
  • C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • C. Finn and S. Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. arXiv preprint arXiv:1710.11622, 2017.
    Findings
  • S. Flennerhag, A. A. Rusu, R. Pascanu, H. Yin, and R. Hadsell. Meta-learning with warped gradient descent. arXiv preprint arXiv:1909.00025, 2019.
    Findings
  • M. Hausknecht, J. Lehman, R. Miikkulainen, and P. Stone. A neuroevolution approach to general atari game playing. IEEE Transactions on Computational Intelligence and AI in Games, 6(4):355–366, 2014.
    Google ScholarLocate open access versionFindings
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997.
    Google ScholarLocate open access versionFindings
  • R. Houthooft, Y. Chen, P. Isola, B. Stadie, F. Wolski, O. J. Ho, and P. Abbeel. Evolved policy gradients. In Advances in Neural Information Processing Systems, pages 5400–5409, 2018.
    Google ScholarLocate open access versionFindings
  • N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, R. C. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. In-datacenter performance analysis of a tensor processing unit. CoRR, abs/1704.04760, 2017.
    Findings
  • L. Kirsch, S. van Steenkiste, and J. Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Findings
  • T. Miconi, J. Clune, and K. O. Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. arXiv preprint arXiv:1804.02464, 2018.
    Findings
  • T. Miconi, A. Rawal, J. Clune, and K. O. Stanley. Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity. arXiv preprint arXiv:2002.10585, 2020.
    Findings
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Google ScholarLocate open access versionFindings
  • A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
    Findings
  • I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Lattimore, C. Szepezvari, S. Singh, et al. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019.
    Findings
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016.
    Google ScholarLocate open access versionFindings
  • J. Schmidhuber. Evolutionary principles in self-referential learning. Master’s thesis, Technische Universitat Munchen, Germany, 1987.
    Google ScholarFindings
  • [28] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, 2014.
    Google ScholarLocate open access versionFindings
  • [29] R. S. Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI, pages 171–176, 1992.
    Google ScholarLocate open access versionFindings
  • [30] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • [31] S. Thrun and T. M. Mitchell. Learning one more thing. In IJCAI, 1995.
    Google ScholarLocate open access versionFindings
  • [32] V. Veeriah, M. Hessel, Z. Xu, J. Rajendran, R. L. Lewis, J. Oh, H. P. van Hasselt, D. Silver, and S. Singh. Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, pages 9306–9317, 2019.
    Google ScholarLocate open access versionFindings
  • [33] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
    Google ScholarLocate open access versionFindings
  • [34] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
    Findings
  • [35] Y. Wang, Q. Ye, and T.-Y. Liu. Beyond exponentially discounted sum: Automatic learning of return function. arXiv preprint arXiv:1905.11591, 2019.
    Findings
  • [36] Z. Xu, H. van Hasselt, M. Hessel, J. Oh, S. Singh, and D. Silver. Meta-gradient reinforcement learning with an objective discovered online. arXiv preprint, 2020.
    Google ScholarFindings
  • [37] Z. Xu, H. P. van Hasselt, and D. Silver. Meta-gradient reinforcement learning. In Advances in neural information processing systems, pages 2396–2407, 2018.
    Google ScholarLocate open access versionFindings
  • [38] T. Zahavy, Z. Xu, V. Veeriah, M. Hessel, J. Oh, H. van Hasselt, D. Silver, and S. Singh. Self-tuning deep reinforcement learning. arXiv preprint arXiv:2002.12928, 2020.
    Findings
  • [39] Z. Zheng, J. Oh, M. Hessel, Z. Xu, M. Kroiss, H. van Hasselt, D. Silver, and S. Singh. What can learned intrinsic rewards capture? arXiv preprint arXiv:1912.05500, 2019.
    Findings
  • [40] Z. Zheng, J. Oh, and S. Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pages 4644–4654, 2018.
    Google ScholarLocate open access versionFindings
  • [41] W. Zhou, Y. Li, Y. Yang, H. Wang, and T. M. Hospedales. Online meta-critic learning for off-policy actor-critic methods. arXiv preprint arXiv:2003.05334, 2020. State index (integer) 9 or 18 7×9 2 × [1, 0.1, 0.01], 5 × [−1, 0.8, 1] 2000
    Findings
  • State index (integer) 9 or 18 11 × 11 4 × [1, 0, 0.005] 2000
    Google ScholarFindings
  • {0, 1}N×H×W 9 or 18 11 × 11 [1, 0, 1] 2000
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科