On the Theory of Transfer Learning: The Importance of Task Diversity

NIPS 2020, 2020.

Cited by: 0|Views73
EI
Weibo:
We provide new statistical guarantees for transfer learning via representation learning--when transfer is achieved by learning a feature representation shared across different tasks

Abstract:

We provide new statistical guarantees for transfer learning via representation learning--when transfer is achieved by learning a feature representation shared across different tasks. This enables learning on new tasks using far less data than is required to learn them in isolation. Formally, we consider $t+1$ tasks parameterized by func...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
  • One of the most promising methods for multitask and transfer learning is founded on the belief that multiple, differing tasks are distinguished by a small number of task-specific parameters, but often share a common low-dimensional representation.
  • There is a theoretical literature that dates back at least as far as [Baxter, 2000]
  • This progress belies a lack of understanding of the basic statistical principles underlying transfer learning1: How many samples do the authors need to learn a feature representation shared across tasks and use it to improve prediction on a new task?
Highlights
  • Transfer learning is quickly becoming an essential tool to address learning problems in settings with small data
  • We formally study the composite learning model in which there are t + 1 tasks whose responses are generated noisily from the function fj⋆ ◦ h⋆, where fj⋆ are task-specific parameters in a function class F and h⋆ an underlying shared representation in a function class H
  • There is a theoretical literature that dates back at least as far as [Baxter, 2000]. This progress belies a lack of understanding of the basic statistical principles underlying transfer learning1: How many samples do we need to learn a feature representation shared across tasks and use it to improve prediction on a new task?
  • In this paper we study a simple two-stage empirical risk minimization procedure to learn a new, j = 0th task which shares a common representation with t different training tasks
  • We introduce a problem-agnostic definition of task diversity which can be integrated into a uniform convergence framework to provide generalization bounds for transfer learning problems with general losses, tasks, and features
  • When n and t are sufficiently large, but m is relatively small, the performance of transfer learning is significantly better than the baseline of learning in isolation
  • We present our central theoretical results for the transfer learning problem
Results
  • The authors present the central theoretical results for the transfer learning problem.
  • The authors first present statistical guarantees for the training phase and test phase separately.
  • The authors present a problem-agnostic definition of task diversity, followed by the generic end-to-end transfer learning guarantee.
  • Throughout this section, the authors make the following standard, mild regularity assumptions on the loss function l(·, ·), the function class of tasks F , and the function class of shared representations H.
  • Assumption 1 (Regularity conditions).
  • The following regularity conditions hold:
Conclusion
  • The authors present a framework for understanding the generalization abilities of generic models for the transfer learning problem.
  • One interesting direction for future consideration is investigating the effects of relaxing the common design and realizability assumptions on the results presented here
Summary
  • Introduction:

    One of the most promising methods for multitask and transfer learning is founded on the belief that multiple, differing tasks are distinguished by a small number of task-specific parameters, but often share a common low-dimensional representation.
  • There is a theoretical literature that dates back at least as far as [Baxter, 2000]
  • This progress belies a lack of understanding of the basic statistical principles underlying transfer learning1: How many samples do the authors need to learn a feature representation shared across tasks and use it to improve prediction on a new task?
  • Objectives:

    The authors' goal is to bound the empirical Gaussian complexity of the set S = {fj(h(xji)) : fj ∈ F , h ∈ H} ⊆ Rtn or function class,
  • Results:

    The authors present the central theoretical results for the transfer learning problem.
  • The authors first present statistical guarantees for the training phase and test phase separately.
  • The authors present a problem-agnostic definition of task diversity, followed by the generic end-to-end transfer learning guarantee.
  • Throughout this section, the authors make the following standard, mild regularity assumptions on the loss function l(·, ·), the function class of tasks F , and the function class of shared representations H.
  • Assumption 1 (Regularity conditions).
  • The following regularity conditions hold:
  • Conclusion:

    The authors present a framework for understanding the generalization abilities of generic models for the transfer learning problem.
  • One interesting direction for future consideration is investigating the effects of relaxing the common design and realizability assumptions on the results presented here
Related work
  • The utility of multitask learning methods was observed at least as far back as Caruana [1997]. In recent years, representation learning, transfer learning, and meta-learning have been the subject of extensive empirical investigation in the machine learning literature (see [Bengio et al, 2013], [Hospedales et al, 2020] for surveys in these directions). However, theoretical work on transfer learning—particularly via representation learning—has been much more limited.

    A line of work closely related to transfer learning is gradient-based meta-learning (MAML) [Finn et al, 2017]. These methods have been analyzed using techniques from online convex optimization, using a (potentially data-dependent) notion of task similarity which assumes that tasks are close to a global task parameter [Finn et al, 2019, Khodak et al, 2019a, Denevi et al, 2019a,b, Khodak et al, 2019b]. However, this line of work does not study the question of transferring a common representation in the generic composite learning model that is our focus.
Reference
  • Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze-driven pretraining of selfattention networks. arXiv preprint arXiv:1903.07785, 2019.
    Findings
  • Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local rademacher complexities. The Annals of Statistics, 33 (4):1497–1537, 2005.
    Google ScholarLocate open access versionFindings
  • Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198, 2000.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
    Google ScholarLocate open access versionFindings
  • Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Ya’acov Ritov, J Klaassen, Jon A Wellner, and YA’Acov Ritov. Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore, 1993.
    Google ScholarLocate open access versionFindings
  • Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
    Google ScholarFindings
  • Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
    Google ScholarLocate open access versionFindings
  • Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask classification. Journal of Machine Learning Research, 11(Oct):2901–2934, 2010.
    Google ScholarLocate open access versionFindings
  • Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. arXiv preprint arXiv:1903.10399, 2019a.
    Findings
  • Giulia Denevi, Dimitris Stamos, Carlo Ciliberto, and Massimiliano Pontil. Online-within-online meta-learning. In Advances in Neural Information Processing Systems, pages 13089–13099, 2019b.
    Google ScholarLocate open access versionFindings
  • Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.
    Google ScholarLocate open access versionFindings
  • Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
    Findings
  • Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, and Burkhard Rost. End-to-end multitask learning, from protein language to protein features without alignments. bioRxiv, 2020. doi: 10.1101/864405.
    Findings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. arXiv preprint arXiv:1902.08438, 2019.
    Findings
  • Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. arXiv preprint arXiv:1712.06541, 2017.
    Findings
  • Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22):2402–2410, 2016.
    Google ScholarLocate open access versionFindings
  • Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020.
    Findings
  • Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems, pages 927–935, 2011.
    Google ScholarLocate open access versionFindings
  • Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. arXiv preprint arXiv:1902.10644, 2019a.
    Findings
  • Mikhail Khodak, Maria-Florina F Balcan, and Ameet S Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, pages 5915–5926, 2019b.
    Google ScholarLocate open access versionFindings
  • Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
    Google ScholarFindings
  • Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10657– 10665, 2019.
    Google ScholarLocate open access versionFindings
  • Qi Li and Jeffrey Scott Racine. Nonparametric econometrics: theory and practice. Princeton University Press, 2007.
    Google ScholarFindings
  • Karim Lounici, Massimiliano Pontil, Sara Van De Geer, Alexandre B Tsybakov, et al. Oracle inequalities and optimal inference under group sparsity. The annals of statistics, 39(4):2164–2204, 2011.
    Google ScholarLocate open access versionFindings
  • Pascal Massart et al. About the constants in talagrand’s concentration inequalities for empirical processes. The Annals of Probability, 28(2):863–884, 2000.
    Google ScholarLocate open access versionFindings
  • Andreas Maurer. A chain rule for the expected suprema of gaussian processes. Theoretical Computer Science, 650: 109–122, 2016.
    Google ScholarLocate open access versionFindings
  • Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
    Google ScholarLocate open access versionFindings
  • Guillaume Obozinski, Martin J Wainwright, Michael I Jordan, et al. Support union recovery in high-dimensional multivariate regression. The Annals of Statistics, 39(1):1–47, 2011.
    Google ScholarLocate open access versionFindings
  • Jakub Otwinowski, David M McCandlish, and Joshua B Plotkin. Inferring the shape of global epistasis. Proceedings of the National Academy of Sciences, 115(32):E7550–E7558, 2018.
    Google ScholarLocate open access versionFindings
  • Massimiliano Pontil and Andreas Maurer. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory, pages 55–76, 2013.
    Google ScholarLocate open access versionFindings
  • Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157, 2019.
    Findings
  • Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representations. arXiv preprint arXiv:2002.11684, 2020.
    Findings
  • Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
    Findings
  • Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
    Google ScholarFindings
  • Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments