## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Modeling and Optimization Trade-off in Meta-learning

NIPS 2020, (2020)

EI

Keywords

Abstract

By searching for shared inductive biases across tasks, meta-learning promises to accelerate learning on novel tasks, but with the cost of solving a complex bilevel optimization problem. We introduce and rigorously define the trade-off between accurate modeling and optimization ease in meta-learning. At one end, classic meta-learning alg...More

Code:

Data:

Introduction

- The major bottleneck of applying machine learning to many practical problems is the cost associated with data and/or labeling.
- While the cost of labeling and data makes supervised learning problems expensive, the high sample complexity of reinforcement learning makes it downright inapplicable for many practical settings.
- Meta-learning is designed to ease the sample complexity of these methods.
- It has had success stories on a wide range of problems including image recognition and reinforcement learning [14].
- Under the PAC framework, Baxter [2] shows that given sufficiently many tasks and data per task during meta-training, there are guarantees on the generalization of learned biases to novel tasks

Highlights

- The major bottleneck of applying machine learning to many practical problems is the cost associated with data and/or labeling
- Meta-learning, or ‘learning to learn’ [24], makes the observation that if the learner has access to a collection of tasks sampled from a distribution p(γ), it can utilize an offline meta-training stage to search for shared inductive biases that assist in learning future tasks from p(γ)
- In other words our result shows that domain randomized search (DRS), it ignores the meta-learning problem structure as discussed in Section 1, provably solves the problem of meta-learning the initialization of an iterative optimization problem under sensible assumptions
- This paper introduces an important trade-off in meta-learning, that of accurately modeling the metalearning problem and complexity of the optimization problem
- Classic meta-learning algorithms account for the structure of the problem space but define complex optimization objectives
- Through an analysis of the sample complexity for smooth nonconvex risk functions, we show that DRS and MAML both solve the meta-learning problem and delineate the roles of optimization complexity and modeling accuracy

Results

- The authors carried out simulations to empirically study the trade-off in the linear regression case.
- Figure 3 shows contour plots of the fraction of the datasets for which the MAML estimate has lower expected loss before meta-testing optimization and after, for several values of α.
- For the first two environments with variations in system dynamics only, seen in Figure 4-(a,b) and 5(a,b), DRS is superior to MAML throughout training.
- For the four environments with variations in reward functions only, either 1) DRS and MAML are comparable (Figure 4-(d) and 5-(e,f)), or.
- In the final two environments with variations in system dynamics and reward functions, the standard errors are generally too large to make a definite statement (see Figure 4-(g,h) and 5-(h))

Conclusion

- This paper introduces an important trade-off in meta-learning, that of accurately modeling the metalearning problem and complexity of the optimization problem.
- Classic meta-learning algorithms account for the structure of the problem space but define complex optimization objectives.
- Through an analysis of the sample complexity for smooth nonconvex risk functions, the authors show that DRS and MAML both solve the meta-learning problem and delineate the roles of optimization complexity and modeling accuracy.
- All three studies show that the balance of the trade-off is determined by the sample sizes but characteristics of the meta-learning problem, such as the smoothness of the task risk functions

Summary

## Introduction:

The major bottleneck of applying machine learning to many practical problems is the cost associated with data and/or labeling.- While the cost of labeling and data makes supervised learning problems expensive, the high sample complexity of reinforcement learning makes it downright inapplicable for many practical settings.
- Meta-learning is designed to ease the sample complexity of these methods.
- It has had success stories on a wide range of problems including image recognition and reinforcement learning [14].
- Under the PAC framework, Baxter [2] shows that given sufficiently many tasks and data per task during meta-training, there are guarantees on the generalization of learned biases to novel tasks
## Results:

The authors carried out simulations to empirically study the trade-off in the linear regression case.- Figure 3 shows contour plots of the fraction of the datasets for which the MAML estimate has lower expected loss before meta-testing optimization and after, for several values of α.
- For the first two environments with variations in system dynamics only, seen in Figure 4-(a,b) and 5(a,b), DRS is superior to MAML throughout training.
- For the four environments with variations in reward functions only, either 1) DRS and MAML are comparable (Figure 4-(d) and 5-(e,f)), or.
- In the final two environments with variations in system dynamics and reward functions, the standard errors are generally too large to make a definite statement (see Figure 4-(g,h) and 5-(h))
## Conclusion:

This paper introduces an important trade-off in meta-learning, that of accurately modeling the metalearning problem and complexity of the optimization problem.- Classic meta-learning algorithms account for the structure of the problem space but define complex optimization objectives.
- Through an analysis of the sample complexity for smooth nonconvex risk functions, the authors show that DRS and MAML both solve the meta-learning problem and delineate the roles of optimization complexity and modeling accuracy.
- All three studies show that the balance of the trade-off is determined by the sample sizes but characteristics of the meta-learning problem, such as the smoothness of the task risk functions

- Table1: Learning rates (LR), step sizes, and inner learning rates chosen by grid search

Related work

- Recent work on few-shot image classification has shown that features from learning a deep network classifier on a large training set combined with a simple classifier at meta-testing may outperform many meta-learning algorithms [32, 4, 25]; a similar observation has been made for few-shot object detection [31]. Packer et al [17] show that DRS outperforms RL2 [5] on simple reinforcement learning environments where tasks correspond to different system dynamics. Our meta-RL experiments complement these works and theoretical studies partially explain them. We argue that there is a larger picture to be considered; the trade-off between modeling accuracy and optimization ease depend on characteristics of the dataset, model, and optimization, and should be studied on a case-by-case basis.

Previous theoretical studies of MAML have primarily focused on the meta-training stage. Finn et al.

Study subjects and analysis

studies: 3

For meta-linear regression, we prove theoretically and verify in simulations that while MAML can utilize the geometry of the distribution of task losses to improve performance through meta-testing optimization, this modeling gain can be counterbalanced by its greater optimization error for small sample sizes. All three studies show that the balance of the trade-off is not only determined by the sample sizes but characteristics of the meta-learning problem, such as the smoothness of the task risk functions. There are several interesting directions for future work

Reference

- Dario Amodei and Danny Hernandez. AI and compute, 2018. URL https://openai.com/blog/ai-and-compute.
- Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198, 2000.
- Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
- Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390, 2020.
- Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv:1611.02779, 2016.
- Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradientbased model-agnostic meta-learning algorithms. arXiv preprint arXiv:1908.10400, 2019.
- Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Provably convergent policy gradient methods for model-agnostic meta-reinforcement learning. arXiv preprint arXiv:2002.05135, 2020.
- Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. arXiv preprint arXiv:1710.11622, 2017.
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1126–1135. JMLR, 2017.
- Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. arXiv preprint arXiv:1902.08438, 2019.
- Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimilano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. arXiv preprint arXiv:1806.04910, 2018.
- Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation. arXiv preprint arXiv:2006.16218, 2020.
- Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020.
- Patrice Marcotte and Gilles Savard. Bilevel programming: A combinatorial perspective. In Graph theory and combinatorial optimization, pages 191–217.
- Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
- Charles Packer, Katelyn Gao, Jernej Kos, Philipp Krähenbühl, Vladlen Koltun, and Dawn Song. Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282, 2018.
- Roger Penrose. On best approximate solutions of linear matrix equations. Mathematical Proceedings of the Cambridge Philosophical Society, 52(1):17–19, 1956.
- Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157, 2019.
- Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, pages 113–124, 2019.
- Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. Promp: Proximal meta-policy search. arXiv preprint arXiv:1810.06784, 2018.
- John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning (ICML), pages 1889–1897, 2015.
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17.
- Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? arXiv preprint arXiv:2003.11539, 2020.
- Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
- Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.
- Haoxiang Wang, Ruoyu Sun, and Bo Li. Global convergence and induced kernels of gradientbased meta-learning with neural nets. arXiv preprint arXiv:2006.14606, 2020.
- Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. On the global optimality of model-agnostic meta-learning. arXiv preprint arXiv:2006.13182, 2020.
- Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu. Frustratingly simple few-shot object detection. arXiv preprint arXiv:2003.06957, 2020.
- Yan Wang, Wei-Lun Chao, Kilian Q Weinberger, and Laurens van der Maaten. Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. arXiv preprint arXiv:1911.04623, 2019.
- Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CoRL), 2019. Each iteration of ProMP (TRPO-MAML) requires twice as many steps from the simulator as DRS+PPO (DRS+TRPO). Therefore, to ensure that each algorithm utilizes the same amount of data, we run ProMP (TRPO-MAML) for half as many iterations as DRS+PPO (DRS+TRPO). More specifically, for the robotic locomotion environments, we run ProMP (TRPO-MAML) for 1000 iterations and DRS+PPO (DRS+TRPO) for 2000. For the manipulation environments, we run ProMP (TRPO-MAML) for 10000 iterations and DRS+PPO (DRS+TRPO) for 20000. These go beyond the number of training steps used in Rothfuss et al. [21] and Yu et al. [33].

Tags

Comments