Theoretical bounds on estimation error for meta-learning

international conference on learning representations, 2020.

Cited by: 0|Bibtex|Views20
Other Links: arxiv.org|academic.microsoft.com
Weibo:
We prove novel minimax risk lower bounds and upper bounds for meta learners

Abstract:

Machine learning models have traditionally been developed under the assumption that the training and test distributions match exactly. However, recent success in few-shot learning and related problems are encouraging signs that these models can be adapted to more realistic settings where train and test distributions differ. Unfortunatel...More

Code:

Data:

0
Introduction
  • Many practical machine learning applications deal with distributional shift from training to testing.
  • The authors derive upper and lower bounds for a particular meta-learning setting.
  • The authors introduce novel lower bounds on minimax risk of parameter estimation in meta-learning.
Highlights
  • Many practical machine learning applications deal with distributional shift from training to testing
  • One example is few-shot classification (Ravi & Larochelle, 2016; Vinyals et al, 2016), where new classes need to be learned at test time based on only a few examples for each novel class
  • We provide novel upper bounds on the error rate for estimation in a hierarchical meta-linearregression problem, which we verify through an empirical evaluation
  • Meta-learning algorithms identify the inductive bias from source tasks and make models more adaptive towards unseen novel distribution
  • We have identified a gap between our lower and upper bounds when there are a large number of training tasks, which we hypothesize is a limitation of the proof technique that we applied to derive the lower bounds — suggesting an exciting direction for future research
Results
  • In Section 5, the authors consider learning in a hierarchical environment of linear models and provide both lower and upper bounds on the error of estimating the parameters of a novel linear regression problem.
  • The authors extend this to a meta-learning, or novel-task setting by first drawing S1:M : n training data points from the first M distributions, for a total of nM samples.
  • The authors derive Corollary 1 that gives a lower bound in terms of the sample size in the training and novel tasks.
  • The authors provide a specialized bound that applies when the environment is partially observed — proving that in this setting training task data is insufficient to drive the minimax risk to zero.
  • The authors bound the statistical estimation error by the error on a corresponding decoding problem where the authors must predict the novel task index, given the meta-training set S1:M and SM+1.
  • The authors prove that no algorithm can generalize perfectly to tasks in unseen regions of the space with small k, regardless of the number of data points n observed in each meta-training task.
  • To investigate the benefit of additional meta-training tasks, the authors compare the derived minimax risk lower bounds to those achieved by i.i.d learners.
  • The authors will compute lower-bounds on the minimax risk using the results from Section 4, revealing a 2d scaling on the meta-training sample complexity.
  • The authors compute lower bounds for parameter estimation with meta-learning over multiple linear regression tasks.
  • Unlike the lower bounds based on local packing, the lower bounds presented predict that if the meta-training tasks cover the space sufficiently an optimal algorithm might hope to reduce the error entirely with enough samples.
  • The authors observed that adding more tasks has a large effect in the low-data regime but, as predicted, the error has a non-zero asymptotic lower-bound — eventually it is more beneficial to add more novel-task data samples.
Conclusion
  • The authors have derived both lower bounds and upper bounds on the error of meta-learners, which are relevant in the few-shot learning setting where k is small.
  • The authors' bounds capture key features of the meta-learning problem, such as the effect of increasing the number of shots or training tasks.
  • The authors have identified a gap between the lower and upper bounds when there are a large number of training tasks, which the authors hypothesize is a limitation of the proof technique that the authors applied to derive the lower bounds — suggesting an exciting direction for future research
Summary
  • Many practical machine learning applications deal with distributional shift from training to testing.
  • The authors derive upper and lower bounds for a particular meta-learning setting.
  • The authors introduce novel lower bounds on minimax risk of parameter estimation in meta-learning.
  • In Section 5, the authors consider learning in a hierarchical environment of linear models and provide both lower and upper bounds on the error of estimating the parameters of a novel linear regression problem.
  • The authors extend this to a meta-learning, or novel-task setting by first drawing S1:M : n training data points from the first M distributions, for a total of nM samples.
  • The authors derive Corollary 1 that gives a lower bound in terms of the sample size in the training and novel tasks.
  • The authors provide a specialized bound that applies when the environment is partially observed — proving that in this setting training task data is insufficient to drive the minimax risk to zero.
  • The authors bound the statistical estimation error by the error on a corresponding decoding problem where the authors must predict the novel task index, given the meta-training set S1:M and SM+1.
  • The authors prove that no algorithm can generalize perfectly to tasks in unseen regions of the space with small k, regardless of the number of data points n observed in each meta-training task.
  • To investigate the benefit of additional meta-training tasks, the authors compare the derived minimax risk lower bounds to those achieved by i.i.d learners.
  • The authors will compute lower-bounds on the minimax risk using the results from Section 4, revealing a 2d scaling on the meta-training sample complexity.
  • The authors compute lower bounds for parameter estimation with meta-learning over multiple linear regression tasks.
  • Unlike the lower bounds based on local packing, the lower bounds presented predict that if the meta-training tasks cover the space sufficiently an optimal algorithm might hope to reduce the error entirely with enough samples.
  • The authors observed that adding more tasks has a large effect in the low-data regime but, as predicted, the error has a non-zero asymptotic lower-bound — eventually it is more beneficial to add more novel-task data samples.
  • The authors have derived both lower bounds and upper bounds on the error of meta-learners, which are relevant in the few-shot learning setting where k is small.
  • The authors' bounds capture key features of the meta-learning problem, such as the effect of increasing the number of shots or training tasks.
  • The authors have identified a gap between the lower and upper bounds when there are a large number of training tasks, which the authors hypothesize is a limitation of the proof technique that the authors applied to derive the lower bounds — suggesting an exciting direction for future research
Tables
  • Table1: Summary of notation used in this manuscript
Download tables as Excel
Related work
  • Baxter (2000) introduced a formulation for inductive bias learning where the learner is embedded in an environment of multiple tasks. The learner must find a hypothesis space which enables good generalization on average tasks within the environment, using finite samples. In our setting, the learner is not explicitly tasked with finding a reduced hypothesis space but instead learns using a two-stage approach, which matches the standard meta-learning paradigm (Vilalta & Drissi, 2002). In the first stage an inductive bias is extracted from the data, and in the second stage the learner estimates using data from a novel task distribution. Further, we focus on bounding minimax risk of meta learners. Under minimax risk, an optimal learner achieves minimum error on the hardest learning problem in the environment. While average case risk of meta learners is more commonly studied, recent work has turned attention towards the minimax setting (Kpotufe & Martinet, 2018; Hanneke & Kpotufe, 2019; 2020; Mousavi Kalan et al, 2020). The worst-case error in meta-learning is particularly important in safety-critical systems, for example in medical diagnosis.
Reference
  • Ron Amit and Ron Meir. Meta-learning by adjusting priors based on extended PAC-bayes theory. arXiv preprint arXiv:1711.01244, 2017.
    Findings
  • Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems 29, pp. 3981–3989, 2016.
    Google ScholarLocate open access versionFindings
  • Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12: 149–198, 2000.
    Google ScholarLocate open access versionFindings
  • Shai Ben-David and Reba Schuller Borbely. A notion of task relatedness yielding provable multipletask learning guarantees. Machine learning, 73(3):273–287, 2008.
    Google ScholarLocate open access versionFindings
  • Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
    Google ScholarLocate open access versionFindings
  • Brian Bullins, Elad Hazan, Adam Kalai, and Roi Livni. Generalize across tasks: Efficient algorithms for linear representation learning. In Aurelien Garivier and Satyen Kale (eds.), Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pp. 235–246, Chicago, Illinois, 22–24 Mar 2019. PMLR. URL http://proceedings.mlr.press/v98/bullins19a.html.
    Locate open access versionFindings
  • Tianshi Cao, Marc Law, and Sanja Fidler. A theoretical analysis of the number of shots in few-shot learning. arXiv preprint arXiv:1909.11722, 2019.
    Findings
  • Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
    Google ScholarFindings
  • Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1566–1575, Long Beach, California, USA, 09–15 Jun 201PMLR.
    Google ScholarLocate open access versionFindings
  • Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
    Findings
  • Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl$ˆ2$: Fast reinforcement learning via slow reinforcement learning. CoRR, abs/1611.02779, 2016.
    Findings
  • Robert Fano. Transmission of information. A Statistical Theory of Communication, 1961.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.
    Google ScholarFindings
  • Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradientbased meta-learning as hierarchical Bayes. arXiv preprint arXiv:1801.08930, 2018.
    Findings
  • Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. In Advances in Neural Information Processing Systems 32, pp. 9871–9881. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • Steve Hanneke and Samory Kpotufe. A no-free-lunch theorem for multitask learning. arXiv preprint arXiv:2006.15785, 2020.
    Findings
  • Rong Jin, Shijun Wang, and Yang Zhou. Regularized distance metric learning:theory and algorithm. In Advances in Neural Information Processing Systems 22, pp. 862–870, 2009.
    Google ScholarLocate open access versionFindings
  • Rafail Z Khas’ minskii. A lower bound on the risks of non-parametric estimates of densities in the uniform metric. Theory of Probability & Its Applications, 23(4):794–798, 1979.
    Google ScholarLocate open access versionFindings
  • Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. arXiv preprint arXiv:1902.10644, 2019.
    Findings
  • Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the benefits of labels in covariateshift. In Sebastien Bubeck, Vianney Perchet, and Philippe Rigollet (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 1882–1886. PMLR, 06–09 Jul 2018.
    Google ScholarLocate open access versionFindings
  • Po-Ling Loh. On lower bounds for statistical learning theory. Entropy, 19(11):617, 2017.
    Google ScholarLocate open access versionFindings
  • Matthew MacKay, Paul Vicol, Jonathan Lorraine, David Duvenaud, and Roger B. Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions. In 7th International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Andreas Maurer. Transfer bounds for linear feature learning. Machine learning, 75(3):327–350, 2009.
    Google ScholarLocate open access versionFindings
  • Nishant Mehta, Dongryeol Lee, and Alexander Gray. Minimax multi-task learning and a generalized loss-compositional paradigm for mtl. Advances in Neural Information Processing Systems, 25: 2150–2158, 2012.
    Google ScholarLocate open access versionFindings
  • Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-Dickstein. Meta-learning update rules for unsupervised representation learning. In 7th International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Mohammadreza Mousavi Kalan, Zalan Fabian, Salman Avestimehr, and Mahdi Soltanolkotabi. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. Advances in Neural Information Processing Systems, 33, 2020.
    Google ScholarLocate open access versionFindings
  • Anastasia Pentina and Christoph Lampert. A PAC-bayesian bound for lifelong learning. In International Conference on Machine Learning, pp. 991–999, 2014.
    Google ScholarLocate open access versionFindings
  • Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over q-balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
    Google ScholarLocate open access versionFindings
  • Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res., 11:3011–3015, 2010.
    Google ScholarLocate open access versionFindings
  • Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. International Conference on Learning Representations, 2016.
    Google ScholarLocate open access versionFindings
  • Herbert Robbins. An empirical bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pp. 157–163, Berkeley, Calif., 1956. University of California Press.
    Google ScholarLocate open access versionFindings
  • Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, pp. 5628–5637, 2019.
    Google ScholarLocate open access versionFindings
  • Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087, 2017.
    Google ScholarLocate open access versionFindings
  • Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artificial intelligence review, 18(2):77–95, 2002.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638, 2016.
    Google ScholarLocate open access versionFindings
  • John Von Neumann. Some matrix-inequalities and metrization of matric space. 1937.
    Google ScholarFindings
  • Under review as a conference paper at ICLR 2021 Boyu Wang, Hejia Zhang, Peng Liu, Zebang Shen, and Joelle Pineau. Multitask metric learning: Theory and algorithm. In Kamalika Chaudhuri and Masashi Sugiyama (eds.), Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pp. 3362– 3371. PMLR, 16–18 Apr 2019. Yuhong Yang and Andrew Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, pp. 1564–1599, 1999.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments