# Meta-Learning without Memorization

ICLR, 2020.

EI

Weibo:

Abstract:

The ability to learn new concepts with small amounts of data is a critical aspect of intelligence that has proven challenging for deep learning methods. Meta-learning has emerged as a promising technique for leveraging data from previous tasks to enable efficient learning of new tasks. However, most meta-learning algorithms implicitly req...More

Code:

https://github.com/google-research/google-research/tree/master/meta_learning_without_memorization

Data:

Introduction

- The ability to learn new concepts and skills with small amounts of data is a critical aspect of intelligence that many machine learning systems lack.
- The meta-learner is trained such that, after being presented with a small task training set, it can accurately make predictions on test datapoints for that meta-training task.
- The model will collapse to one that makes zero-shot decisions
- This presents an opportunity for overfitting where the meta-learner generalizes on meta-training tasks, but fails to adapt when presented with training data from novel tasks.
- The authors call this form of overfitting the memorization problem in meta-learning because the meta-learner memorizes a function that solves all of the meta-training tasks, rather than learning to adapt

Highlights

- The ability to learn new concepts and skills with small amounts of data is a critical aspect of intelligence that many machine learning systems lack
- We show that meta-regularization in model-agnostic meta-learning can be rigorously motivated by a PAC-Bayes bound on generalization
- We find that model-agnostic meta-learning and conditional neural processes frequently converge to this memorization solution (Table 2)
- We consider model-agnostic meta-learning (MAML) and conditional neural processes (CNP) as representative meta-learning algorithms. We study both variants of our method in combination with model-agnostic meta-learning and conditional neural processes
- Once we add the additional amplitude input which indicates the task identity, we find that both model-agnostic meta-learning and conditional neural processes converge to the complete memorization solution and fail to generalize well to test data (Table 1 and Appendix Figures 7 and 8)
- We evaluate model-agnostic meta-learning, TAML (Jamal & Qi, 2019), MR-model-agnostic meta-learning, fine-tuning, and a nearest neighbor baseline on non-mutually-exclusive classification tasks (Table 4)

Methods

- MAML

MR-MAML (A) MR-MAML (W) CNP

MR-CNP (A) MR-CNP (W)

5 shot 0.46 (0.04) 10 shot 0.13 (0.01)

6.2 POSE PREDICTION

To illustrate the memorization problem on a more realistic task, the authors create a multi-task regression dataset based on the Pascal 3D data (Xiang et al, 2014) (See Appendix A.5.1 for a complete description). - Because the number of objects in the meta-training dataset is small, it is straightforward for a single network to memorize the canonical pose for each training object and to infer the orientation from the input image, achieving a low meta-training error without using D.
- The high pre-update meta-training accuracy and low meta-test accuracy are evidence of the memorization problem for MAML and TAML, indicating that it is learning a model that ignores the task data.
- MR-MAML successfully controls the pre-update accuracy to be near chance and encourages the learner to use the task training data to achieve low meta-training error, resulting in good performance at meta-test time

Conclusion

**CONCLUSION AND DISCUSSION**

Meta-learning has achieved remarkable success in few-shot learning problems.- The key idea is that by placing a soft restriction on the information flow from meta-parameters in prediction of test set labels, the authors can encourage the meta-learner to use task training data during meta-training.
- The authors achieve this by successfully controlling the complexity of model prior to the task adaptation

Summary

## Introduction:

The ability to learn new concepts and skills with small amounts of data is a critical aspect of intelligence that many machine learning systems lack.- The meta-learner is trained such that, after being presented with a small task training set, it can accurately make predictions on test datapoints for that meta-training task.
- The model will collapse to one that makes zero-shot decisions
- This presents an opportunity for overfitting where the meta-learner generalizes on meta-training tasks, but fails to adapt when presented with training data from novel tasks.
- The authors call this form of overfitting the memorization problem in meta-learning because the meta-learner memorizes a function that solves all of the meta-training tasks, rather than learning to adapt
## Methods:

MAML

MR-MAML (A) MR-MAML (W) CNP

MR-CNP (A) MR-CNP (W)

5 shot 0.46 (0.04) 10 shot 0.13 (0.01)

6.2 POSE PREDICTION

To illustrate the memorization problem on a more realistic task, the authors create a multi-task regression dataset based on the Pascal 3D data (Xiang et al, 2014) (See Appendix A.5.1 for a complete description).- Because the number of objects in the meta-training dataset is small, it is straightforward for a single network to memorize the canonical pose for each training object and to infer the orientation from the input image, achieving a low meta-training error without using D.
- The high pre-update meta-training accuracy and low meta-test accuracy are evidence of the memorization problem for MAML and TAML, indicating that it is learning a model that ignores the task data.
- MR-MAML successfully controls the pre-update accuracy to be near chance and encourages the learner to use the task training data to achieve low meta-training error, resulting in good performance at meta-test time
## Conclusion:

**CONCLUSION AND DISCUSSION**

Meta-learning has achieved remarkable success in few-shot learning problems.- The key idea is that by placing a soft restriction on the information flow from meta-parameters in prediction of test set labels, the authors can encourage the meta-learner to use task training data during meta-training.
- The authors achieve this by successfully controlling the complexity of model prior to the task adaptation

- Table1: Test MSE for the non-mutually-exclusive sinusoid regression problem. We compare MAML and CNP against meta-regularized MAML (MR-MAML) and meta-regularized CNP (MR-CNP) where regularization is either on the activations (A) or the weights (W). We report the mean over 5 trials and the standard deviation in parentheses
- Table2: Meta-test MSE for the pose prediction problem. We compare MR-MAML (ours) with conventional MAML and fine-tuning (FT). We report the average over 5 trials and standard deviation in parentheses
- Table3: Meta-testing MSE for the pose prediction problem. We compare MR-CNP (ours) with conventional CNP, CNP with weight decay, and CNP with Bayes-by-Backprop (BbB) regularization on all the weights. We report the average over 5 trials and standard deviation in parentheses
- Table4: Meta-test accuracy on non-mutually-exclusive (NME) classification. The fine-tuning and nearestneighbor baseline results for MiniImagenet are from (<a class="ref-link" id="cRavi_2016_a" href="#rRavi_2016_a">Ravi & Larochelle, 2016</a>)
- Table5: Meta-training pre-update accuracy on non-mutually-exclusive classification. MR-MAML controls the meta-training pre-update accuracy close to random guess and achieves low training error after adaptation

Related work

- Previous works have developed approaches for mitigating various forms of overfitting in metalearning. These approaches aim to improve generalization in several ways: by reducing the number of parameters that are adapted in MAML (Zintgraf et al, 2019), by compressing the task embedding (Lee et al, 2019), through data augmentation from a GAN (Zhang et al, 2018), by using an auxiliary objective on task gradients (Guiroy et al, 2019), and via an entropy regularization objective (Jamal & Qi, 2019). These methods all focus on the setting with mutually-exclusive task distributions. We instead recognize and formalize the memorization problem, a particular form of overfitting that manifests itself with non-mutually-exclusive tasks, and offer a general and principled solution. Unlike prior methods, our approach is applicable to both contextual and gradientbased meta-learning methods. We additionally validate that prior regularization approaches, namely TAML (Jamal & Qi, 2019), are not effective for addressing this problem setting.

Funding

- Zhou acknowledge the support of the U.S National Science Foundation under Grant IIS-1812699

Reference

- Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018.
- Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
- Ron Amit and Ron Meir. Meta-learning by adjusting priors based on extended pac-bayes theory. In International Conference on Machine Learning, pp. 205–214, 2018.
- Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
- Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
- Harrison Edwards and Amos Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
- Li Fei-Fei et al. A bayesian approach to unsupervised one-shot learning of object categories. In Proceedings Ninth IEEE International Conference on Computer Vision, pp. 1134–1141. IEEE, 2003.
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. JMLR. org, 2017.
- Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 9516–9527, 2018.
- Tomer Galanti, Lior Wolf, and Tamir Hazan. A theoretical framework for deep transfer learning. Information and Inference: A Journal of the IMA, 5(2):159–209, 2016.
- Marta Garnelo, Dan Rosenbaum, Chris J Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J Rezende, and SM Eslami. Conditional neural processes. arXiv preprint arXiv:1807.01613, 2018a.
- Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint, 2018b.
- Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard E Turner. Metalearning probabilistic inference for prediction. arXiv preprint arXiv:1805.09921, 2018.
- Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradientbased meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
- Simon Guiroy, Vikas Verma, and Christopher Pal. Towards understanding generalization in gradientbased meta-learning. arXiv preprint arXiv:1907.07287, 2019.
- James Harrison, Apoorva Sharma, and Marco Pavone. Meta-learning priors for efficient online bayesian regression. arXiv preprint arXiv:1807.08912, 2018.
- Muhammad Abdullah Jamal and Guo-Jun Qi. Task agnostic meta-learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11719– 11727, 2019.
- Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. arXiv preprint arXiv:1806.03836, 2018.
- Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2, 2015.
- Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957, 1992.
- Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, volume 33, 2011.
- Yoonho Lee, Wonjae Kim, and Seungjin Choi. Discrete infomax codes for meta-learning. arXiv preprint arXiv:1905.11656, 2019.
- Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt in dynamic, real-world environments through metareinforcement learning. arXiv preprint arXiv:1803.11347, 2018.
- Anastasia Pentina and Christoph Lampert. A pac-bayesian bound for lifelong learning. In International Conference on Machine Learning, pp. 991–999, 2014.
- Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Efficient off-policy meta-reinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254, 2019.
- Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR 2016, 2016.
- Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850, 2016.
- Jurgen Schmidhuber. Evolutionary principles in self-referential learning. On learning how to learn: The meta-meta-... hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1:2, 1987.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Joshua Brett Tenenbaum. A Bayesian framework for concept learning. PhD thesis, Massachusetts Institute of Technology, 1999.
- Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.
- Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. IEEE, 2015.
- Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
- Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012.
- Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638, 2016.
- Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2014.
- Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. Metagan: An adversarial approach to few-shot learning. In Advances in Neural Information Processing Systems, pp. 2365–2374, 2018.
- Luisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. In Thirty-sixth International Conference on Machine Learning (ICML 2019), 2019.
- Similar to (Achille & Soatto, 2018), we use ξ to denote the unknown parameters of the true data generating distribution. This defines a joint distribution p(ξ, M, θ) = p(ξ)p(M|ξ)q(θ|M). Furthermore, we have a predictive distribution q(y∗|x∗, D, θ) = Eφ|θ,D [q(y∗|x∗, φ, θ)].
- The meta-training loss in Eq. 1 is an upper bound for the cross entropy Hp,q(y1∗:N |x∗1:N, D1:N, θ). Using an information decomposition of cross entropy (Achille & Soatto, 2018), we have
- where for exposition we assume K = |Di∗| is the same for all i. We would like to relate er(Q) and er(Q, D1, D1∗,..., Dn, Dn∗ ), but the challenge is that Q may depend on D1, D1∗,..., Dn, Dn∗ due to the learning algorithm. There are two sources of generalization error: (i) error due to the finite number of observed tasks and (ii) error due to the finite number of examples observed per task. Closely following the arguments in (Amit & Meir, 2018), we apply a standard PAC-Bayes bound to each of these and combine the results with a union bound.
- Q(θ)q(φ|θ, Di)dθ for any Q. While, π and ρ may be complicated distributions (especially, if they are defined implicitly), we know that with this choice of π and ρ, DKL(ρ||π) ≤ DKL(Q||P ) (Cover & Thomas, 2012), hence, we have
- We create a multi-task regression dataset based on the Pascal 3D data (Xiang et al., 2014). The dataset consists of 10 classes of 3D object such as “aeroplane”, “sofa”, “TV monitor”, etc. Each class has multiple different objects and there are 65 objects in total. We randomly select 50 objects for meta-training and the other 15 objects for meta-testing. For each object, we use MuJoCo (Todorov et al., 2012) to render 100 images with random orientations of the instance on a table, visualized in Figure 1. For the meta-learning algorithm, the observation (x) is the 128 × 128 gray-scale image and the label (y) is the orientation re-scaled to be within [0, 10]. For each task, we randomly sample

Full Text

Tags

Comments