## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# On the Global Optimality of Model-Agnostic Meta-Learning

ICML, pp.9837-9846, (2020)

EI

Keywords

Abstract

Model-agnostic meta-learning (MAML) formulates meta-learning as a bilevel optimization problem, where the inner level solves each subtask based on a shared prior, while the outer level searches for the optimal shared prior by optimizing its aggregated performance over all the subtasks. Despite its empirical success, MAML remains less un...More

Code:

Data:

Introduction

- Meta-learning aims to find a prior that efficiently adapts to a new subtask based on past subtasks.
- Under an assumption on the functional convexity of the inner-level objective, the authors characterize the optimality gap of the ǫ-stationary point attained by meta-SL.
- MAML aims to find the globally optimal starting point θ∗ by minimizing the following meta-objective by gradient descent, L(θ)

Highlights

- Meta-learning aims to find a prior that efficiently adapts to a new subtask based on past subtasks
- For meta-reinforcement learning, we study a variant of model-agnostic meta-learning, which associates the solution to each subtask with the shared prior, namely πθ, through one step of proximal policy optimization (PPO) (Schulman et al, 2015, 2017) in the inner level of optimization
- We analyze the global optimality of the ǫ-stationary point attained by meta-reinforcement learning (Algorithm 1)
- We characterize the global optimality of the ǫ-stationary point attained by meta-supervised learning defined in (4.5)
- We analyze the global optimality of the ǫ-stationary point ω of the meta-objective L attained by neural meta-reinforcement learning

Results

- To analyze the global optimality of meta-RL, the authors define the following meta-visitation measures induced by the main effect πθ.
- For i = j, to ensure the upper bound of the L2(̺πθ )-norms of dσπi,θ /dςj,πθ in (3.12), Assumption 3.4 requires the task distribution ι to generate similar MDPs so that the meta-visitation measures {ςi,πθ }i∈[n] are similar across all the subtasks indexed by i ∈ [n].
- Given the ǫ-stationary point ω, if the output hωi of Aω(Di, l, H) well approximates the minimizer of the risk Ri in (4.1), the Frechet derivative δRi/δhωi defined in (4.9) is close to zero.
- For both neural meta-RL and neural meta-SL, the authors show that the global optimality of the attained ǫ-stationary points hinges on the representation power of the corresponding classes of overparameterized two-layer neural networks.
- Neural meta-RL maximizes the following meta-objective via gradient ascent with Winit as the starting point, L(θ) where πi,θ is defined in (5.4), and Ji is the expected total reward of πi,θ corresponding to the MDP (S, A, Pi, ri, γi, ζi).
- The authors analyze the global optimality of the ǫ-stationary point ω of the meta-objective L attained by neural meta-RL.
- If the class of overparameterized two-layer neural networks with the parameter space Binit has sufficient representation power, the ǫ-stationary point ω attained by neural meta-RL is approximately globally optimal.
- The authors analyze the global optimality of the ǫ-stationary point attained by neural meta-SL associated with the squared loss, where the authors parameterize the hypothesis hθ(·) = f (·; θ) by the neural network defined in (5.1).

Conclusion

- The authors analyze the global optimality of the ǫ-stationary point ω attained by neural meta-SL, which is defined as follows,
- Under Assumption 5.3, the following corollary characterizes the optimality gap of the ǫ-stationary point ω defined in (5.9).
- Similar to Corollary 5.2, by Corollary 5.4, if the function u defined in (5.11) is well approximated by an overparameterized two-layer neural network with a parameter from the parameter space B0 defined in (5.10), and the average risk R defined in (5.12) is upper bounded, the ǫ-stationary point ω attained by neural meta-SL is approximately globally optimal.

Summary

- Meta-learning aims to find a prior that efficiently adapts to a new subtask based on past subtasks.
- Under an assumption on the functional convexity of the inner-level objective, the authors characterize the optimality gap of the ǫ-stationary point attained by meta-SL.
- MAML aims to find the globally optimal starting point θ∗ by minimizing the following meta-objective by gradient descent, L(θ)
- To analyze the global optimality of meta-RL, the authors define the following meta-visitation measures induced by the main effect πθ.
- For i = j, to ensure the upper bound of the L2(̺πθ )-norms of dσπi,θ /dςj,πθ in (3.12), Assumption 3.4 requires the task distribution ι to generate similar MDPs so that the meta-visitation measures {ςi,πθ }i∈[n] are similar across all the subtasks indexed by i ∈ [n].
- Given the ǫ-stationary point ω, if the output hωi of Aω(Di, l, H) well approximates the minimizer of the risk Ri in (4.1), the Frechet derivative δRi/δhωi defined in (4.9) is close to zero.
- For both neural meta-RL and neural meta-SL, the authors show that the global optimality of the attained ǫ-stationary points hinges on the representation power of the corresponding classes of overparameterized two-layer neural networks.
- Neural meta-RL maximizes the following meta-objective via gradient ascent with Winit as the starting point, L(θ) where πi,θ is defined in (5.4), and Ji is the expected total reward of πi,θ corresponding to the MDP (S, A, Pi, ri, γi, ζi).
- The authors analyze the global optimality of the ǫ-stationary point ω of the meta-objective L attained by neural meta-RL.
- If the class of overparameterized two-layer neural networks with the parameter space Binit has sufficient representation power, the ǫ-stationary point ω attained by neural meta-RL is approximately globally optimal.
- The authors analyze the global optimality of the ǫ-stationary point attained by neural meta-SL associated with the squared loss, where the authors parameterize the hypothesis hθ(·) = f (·; θ) by the neural network defined in (5.1).
- The authors analyze the global optimality of the ǫ-stationary point ω attained by neural meta-SL, which is defined as follows,
- Under Assumption 5.3, the following corollary characterizes the optimality gap of the ǫ-stationary point ω defined in (5.9).
- Similar to Corollary 5.2, by Corollary 5.4, if the function u defined in (5.11) is well approximated by an overparameterized two-layer neural network with a parameter from the parameter space B0 defined in (5.10), and the average risk R defined in (5.12) is upper bounded, the ǫ-stationary point ω attained by neural meta-SL is approximately globally optimal.

Related work

- Meta-learning is studied by various communities (Evgeniou and Pontil, 2004; Thrun and Pratt, 2012; Pentina and Lampert, 2014; Amit and Meir, 2017; Nichol et al, 2018; Nichol and Schulman, 2018; Khodak et al, 2019). See Pan and Yang (2009); Weiss et al (2016) for the surveys of meta-learning and Taylor and Stone (2009) for a survey of meta-RL. Our work focuses on the model-agnostic formulation of meta-learning (MAML) proposed by Finn et al (2017a). In contrast to existing empirical studies, the theoretical analysis of MAML is relatively scarce. Fallah et al (2019) establish the convergence of three variants of MAML for nonconvex metaobjectives. Rajeswaran et al (2019) propose a variant of MAML that utilizes implicit gradients of the inner level of optimization and establish the convergence of such an algorithm. This line of work characterizes the convergence of MAML to the stationary points of the corresponding meta-objectives. Our work is complementary to this line of work in the sense that we characterize the global optimality of the stationary points attained by MAML. Meanwhile, Finn et al (2019) propose an online algorithm for MAML with regret guarantees, which rely on the strong convexity of the meta-objectives. In contrast, our work tackles nonconvex meta-objectives, which allows for neural function approximators, and characterizes the global optimality of MAML. Mendonca et al (2019) propose a meta-policy search method and characterize the global optimality for solving the subtasks under the assumption that the meta-objective is (approximately) globally optimal. Our work is complementary to their work in the sense that we characterize the global optimality of MAML in terms of optimizing the meta-objective. See also the concurrent work (Wang et al, 2020).

Reference

- Allen-Zhu, Z., Li, Y. and Liang, Y. (2018a). Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918.
- Allen-Zhu, Z., Li, Y. and Song, Z. (2018b). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.
- Amit, R. and Meir, R. (2017). Meta-learning by adjusting priors based on extended PAC-Bayes theory. arXiv preprint arXiv:1711.01244.
- Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584.
- Bai, Y. and Lee, J. D. (2019). Beyond linearization: On quadratic and higher-order approximation of wide neural networks. arXiv preprint arXiv:1910.01619.
- Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems.
- Cao, Y. and Gu, Q. (2019a). Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210.
- Cao, Y. and Gu, Q. (2019b). A generalization theory of gradient descent for learning overparameterized deep ReLU networks. arXiv preprint arXiv:1902.01384.
- Chizat, L. and Bach, F. (2018). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956.
- Daniely, A. (2017). SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems.
- Du, S. S., Lee, J. D., Li, H., Wang, L. and Zhai, X. (2018a). Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804.
- Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2018b). Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054.
- Ekeland, I. and Temam, R. (1999). Convex analysis and variational problems, vol.
- Evgeniou, T. and Pontil, M. (2004). Regularized multi–task learning. In International Conference on Knowledge Discovery and Data Mining.
- Facchinei, F. and Pang, J.-S. (2007). Finite-dimensional variational inequalities and complementarity problems. Springer Science & Business Media.
- Fallah, A., Mokhtari, A. and Ozdaglar, A. (2019). On the convergence theory of gradient-based model-agnostic meta-learning algorithms. arXiv preprint arXiv:1908.10400.
- Fan, J., Ma, C. and Zhong, Y. (2019). A selective overview of deep learning. arXiv preprint arXiv:1904.05526.
- Finn, C., Abbeel, P. and Levine, S. (2017a). Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning.
- Finn, C., Rajeswaran, A., Kakade, S. and Levine, S. (2019). Online meta-learning. arXiv preprint arXiv:1902.08438.
- Finn, C., Xu, K. and Levine, S. (2018). Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems.
- Finn, C., Yu, T., Zhang, T., Abbeel, P. and Levine, S. (2017b). One-shot visual imitation learning via meta-learning. arXiv preprint arXiv:1709.04905.
- Gupta, A., Mendonca, R., Liu, Y., Abbeel, P. and Levine, S. (2018). Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems.
- Haarnoja, T., Tang, H., Abbeel, P. and Levine, S. (2017). Reinforcement learning with deep energybased policies. In International Conference on Machine Learning.
- Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems.
- Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning.
- Khodak, M., Balcan, M.-F. and Talwalkar, A. (2019). Provable guarantees for gradient-based metalearning. arXiv preprint arXiv:1902.10644.
- Konda, V. (2002). Actor-Critic Algorithms. Ph.D. thesis, Massachusetts Institute of Technology.
- Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J. and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720.
- Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems.
- Li, Z., Zhou, F., Chen, F. and Li, H. (2017). Meta-SGD: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835.
- Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
- Mendonca, R., Gupta, A., Kralev, R., Abbeel, P., Levine, S. and Finn, C. (2019). Guided metapolicy search. In Advances in Neural Information Processing Systems.
- Nagabandi, A., Finn, C. and Levine, S. (2018). Deep online learning via meta-learning: Continual adaptation for model-based RL. arXiv preprint arXiv:1812.07671.
- Nichol, A., Achiam, J. and Schulman, J. (2018). On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999.
- Nichol, A. and Schulman, J. (2018). Reptile: A scalable meta-learning algorithm. arXiv preprint arXiv:1803.02999, 2 2.
- Pan, S. J. and Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22 1345–1359.
- Pentina, A. and Lampert, C. (2014). A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning.
- Rajeswaran, A., Finn, C., Kakade, S. M. and Levine, S. (2019). Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems.
- Rakelly, K., Shelhamer, E., Darrell, T., Efros, A. A. and Levine, S. (2018). Few-shot segmentation propagation with guided networks. arXiv preprint arXiv:1806.07373.
- Rockafellar, R. (1968). Integrals which are convex functionals. Pacific journal of mathematics, 24 525–539.
- Rudin, W. (2006). Real and complex analysis. McGraw-Hill.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3 9–44.
- Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
- Taylor, M. E. and Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10 1633–1685.
- Thrun, S. and Pratt, L. (2012). Learning to learn. Springer Science & Business Media.
- Wang, H., Sun, R. and Li, B. (2020). Global convergence and induced kernels of gradient-based meta-learning with neural nets. To appear on arXiv.
- Weiss, K., Khoshgoftaar, T. M. and Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3 9.
- Wu, L., Ma, C. and Weinan, E. (2018). How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems.
- Xu, K., Ratner, E., Dragan, A., Levine, S. and Finn, C. (2018). Learning a prior over intent via meta-inverse reinforcement learning. arXiv preprint arXiv:1805.12573.
- Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y. and Ahn, S. (2018). Bayesian model-agnostic metalearning. In Advances in Neural Information Processing Systems.
- Yu, T., Abbeel, P., Levine, S. and Finn, C. (2018). One-shot hierarchical imitation learning of compound visuomotor tasks. arXiv preprint arXiv:1810.11043.
- Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2018). Stochastic gradient descent optimizes overparameterized deep ReLU networks. arXiv preprint arXiv:1811.08888.
- Proof. The proof hinges on the following lemma, which is adapted from Cai et al. (2019).
- Lemma B.2 (Linearization Error (Cai et al., 2019)). Under Assumption 5.1, it holds for ω0, ω1, ω2 ∈ B = {θ ∈ Rmd: θ − Winit 2 ≤ R} that

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn