# On the Theory of Transfer Learning: The Importance of Task Diversity

NIPS 2020, 2020.

EI

Weibo:

Abstract:

We provide new statistical guarantees for transfer learning via representation learning--when transfer is achieved by learning a feature representation shared across different tasks. This enables learning on new tasks using far less data than is required to learn them in isolation. Formally, we consider $t+1$ tasks parameterized by func...More

Code:

Data:

Introduction

- One of the most promising methods for multitask and transfer learning is founded on the belief that multiple, differing tasks are distinguished by a small number of task-specific parameters, but often share a common low-dimensional representation.
- There is a theoretical literature that dates back at least as far as [Baxter, 2000]
- This progress belies a lack of understanding of the basic statistical principles underlying transfer learning1: How many samples do the authors need to learn a feature representation shared across tasks and use it to improve prediction on a new task?

Highlights

- Transfer learning is quickly becoming an essential tool to address learning problems in settings with small data
- We formally study the composite learning model in which there are t + 1 tasks whose responses are generated noisily from the function fj⋆ ◦ h⋆, where fj⋆ are task-specific parameters in a function class F and h⋆ an underlying shared representation in a function class H
- There is a theoretical literature that dates back at least as far as [Baxter, 2000]. This progress belies a lack of understanding of the basic statistical principles underlying transfer learning1: How many samples do we need to learn a feature representation shared across tasks and use it to improve prediction on a new task?
- In this paper we study a simple two-stage empirical risk minimization procedure to learn a new, j = 0th task which shares a common representation with t different training tasks
- We introduce a problem-agnostic definition of task diversity which can be integrated into a uniform convergence framework to provide generalization bounds for transfer learning problems with general losses, tasks, and features
- When n and t are sufficiently large, but m is relatively small, the performance of transfer learning is significantly better than the baseline of learning in isolation
- We present our central theoretical results for the transfer learning problem

Results

- The authors present the central theoretical results for the transfer learning problem.
- The authors first present statistical guarantees for the training phase and test phase separately.
- The authors present a problem-agnostic definition of task diversity, followed by the generic end-to-end transfer learning guarantee.
- Throughout this section, the authors make the following standard, mild regularity assumptions on the loss function l(·, ·), the function class of tasks F , and the function class of shared representations H.
- Assumption 1 (Regularity conditions).
- The following regularity conditions hold:

Conclusion

- The authors present a framework for understanding the generalization abilities of generic models for the transfer learning problem.
- One interesting direction for future consideration is investigating the effects of relaxing the common design and realizability assumptions on the results presented here

Summary

## Introduction:

One of the most promising methods for multitask and transfer learning is founded on the belief that multiple, differing tasks are distinguished by a small number of task-specific parameters, but often share a common low-dimensional representation.- There is a theoretical literature that dates back at least as far as [Baxter, 2000]
- This progress belies a lack of understanding of the basic statistical principles underlying transfer learning1: How many samples do the authors need to learn a feature representation shared across tasks and use it to improve prediction on a new task?
## Objectives:

The authors' goal is to bound the empirical Gaussian complexity of the set S = {fj(h(xji)) : fj ∈ F , h ∈ H} ⊆ Rtn or function class,## Results:

The authors present the central theoretical results for the transfer learning problem.- The authors first present statistical guarantees for the training phase and test phase separately.
- The authors present a problem-agnostic definition of task diversity, followed by the generic end-to-end transfer learning guarantee.
- Throughout this section, the authors make the following standard, mild regularity assumptions on the loss function l(·, ·), the function class of tasks F , and the function class of shared representations H.
- Assumption 1 (Regularity conditions).
- The following regularity conditions hold:
## Conclusion:

The authors present a framework for understanding the generalization abilities of generic models for the transfer learning problem.- One interesting direction for future consideration is investigating the effects of relaxing the common design and realizability assumptions on the results presented here

Related work

- The utility of multitask learning methods was observed at least as far back as Caruana [1997]. In recent years, representation learning, transfer learning, and meta-learning have been the subject of extensive empirical investigation in the machine learning literature (see [Bengio et al, 2013], [Hospedales et al, 2020] for surveys in these directions). However, theoretical work on transfer learning—particularly via representation learning—has been much more limited.

A line of work closely related to transfer learning is gradient-based meta-learning (MAML) [Finn et al, 2017]. These methods have been analyzed using techniques from online convex optimization, using a (potentially data-dependent) notion of task similarity which assumes that tasks are close to a global task parameter [Finn et al, 2019, Khodak et al, 2019a, Denevi et al, 2019a,b, Khodak et al, 2019b]. However, this line of work does not study the question of transferring a common representation in the generic composite learning model that is our focus.

Reference

- Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze-driven pretraining of selfattention networks. arXiv preprint arXiv:1903.07785, 2019.
- Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local rademacher complexities. The Annals of Statistics, 33 (4):1497–1537, 2005.
- Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198, 2000.
- Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Ya’acov Ritov, J Klaassen, Jon A Wellner, and YA’Acov Ritov. Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore, 1993.
- Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
- Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
- Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask classification. Journal of Machine Learning Research, 11(Oct):2901–2934, 2010.
- Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. arXiv preprint arXiv:1903.10399, 2019a.
- Giulia Denevi, Dimitris Stamos, Carlo Ciliberto, and Massimiliano Pontil. Online-within-online meta-learning. In Advances in Neural Information Processing Systems, pages 13089–13099, 2019b.
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.
- Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
- Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, and Burkhard Rost. End-to-end multitask learning, from protein language to protein features without alignments. bioRxiv, 2020. doi: 10.1101/864405.
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
- Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. arXiv preprint arXiv:1902.08438, 2019.
- Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. arXiv preprint arXiv:1712.06541, 2017.
- Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22):2402–2410, 2016.
- Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020.
- Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems, pages 927–935, 2011.
- Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. arXiv preprint arXiv:1902.10644, 2019a.
- Mikhail Khodak, Maria-Florina F Balcan, and Ameet S Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, pages 5915–5926, 2019b.
- Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
- Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10657– 10665, 2019.
- Qi Li and Jeffrey Scott Racine. Nonparametric econometrics: theory and practice. Princeton University Press, 2007.
- Karim Lounici, Massimiliano Pontil, Sara Van De Geer, Alexandre B Tsybakov, et al. Oracle inequalities and optimal inference under group sparsity. The annals of statistics, 39(4):2164–2204, 2011.
- Pascal Massart et al. About the constants in talagrand’s concentration inequalities for empirical processes. The Annals of Probability, 28(2):863–884, 2000.
- Andreas Maurer. A chain rule for the expected suprema of gaussian processes. Theoretical Computer Science, 650: 109–122, 2016.
- Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
- Guillaume Obozinski, Martin J Wainwright, Michael I Jordan, et al. Support union recovery in high-dimensional multivariate regression. The Annals of Statistics, 39(1):1–47, 2011.
- Jakub Otwinowski, David M McCandlish, and Joshua B Plotkin. Inferring the shape of global epistasis. Proceedings of the National Academy of Sciences, 115(32):E7550–E7558, 2018.
- Massimiliano Pontil and Andreas Maurer. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory, pages 55–76, 2013.
- Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157, 2019.
- Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representations. arXiv preprint arXiv:2002.11684, 2020.
- Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
- Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.

Tags

Comments