# Meta-learning with differentiable closed-form solvers.

international conference on learning representations, (2019)

EI

摘要

Adapting deep networks to new concepts from few examples is extremely challenging, due to the high computational and data requirements of standard fine-tuning procedures. Most works on meta-learning and few-shot learning have thus focused on simple learning techniques for adaptation, such as nearest neighbors or gradient descent. Nonethel...更多

代码：

数据：

简介

- Humans can efficiently perform fast mapping (Carey, 1978; Carey & Bartlett, 1978), i.e. learning a new concept after a single exposure.
- By contrast, supervised learning algorithms — and neural networks in particular — typically need to be trained using a vast amount of data in order to generalize well.
- This requirement is problematic, as the availability of large labelled datasets cannot always be taken for granted.
- In which just one or a handful of training examples is provided, is referred to as one-shot or few-shot learning (Miller et al, 2000; Fei-Fei et al, 2006; Lake et al, 2015; Hariharan & Girshick, 2017) and has recently seen a tremendous surge in interest within the machine learning community (e.g.Vinyals et al (2016); Bertinetto et al (2016); Ravi & Larochelle (2017); Finn et al (2017))

重点内容

- Humans can efficiently perform fast mapping (Carey, 1978; Carey & Bartlett, 1978), i.e. learning a new concept after a single exposure
- The base learner works at the level of individual episodes, which correspond to learning problems characterised by having only a small set of labelled training images available
- We propose to adopt simple learning algorithms that admit a closed-form solution such as ridge regression
- With the aim of allowing efficient adaptation to unseen learning problems, in this paper we explored the feasibility of incorporating fast solvers with closed-form solutions as the base learning component of a meta-learning system
- R2-D2, the differentiable ridge regression base learner we introduce, is almost as fast as prototypical networks and strikes a useful compromise between not performing adaptation for new episodes and conducting a costly iterative approach
- Our proposed method achieves an average accuracy that, on miniImageNet and CIFAR-FS, is superior to the state of the art with shallow architectures
- We showed that our base learners work remarkably well, with excellent results on few-shot learning benchmarks, generalizing to episodes with new classes that were not seen during training

方法

- 3.1 META-LEARNING

According to widely accepted definitions of learning (Mitchell, 1980) and meta-learning (Vilalta & Drissi, 2002; Vinyals et al, 2016), an algorithm is “learning to learn” if it can improve its learning skills with the number of experienced episodes. - The meta-learner learns from several such episodes in sequence with the goal of improving the performance of the base learner across episodes.
- In order to ensure a fair comparison, the authors increased the capacity of the architectures of three representative methods (MAML, prototypical networks and GNN) to match ours.
- The results of these experiments are reported with a ∗ on Table 1.
- The authors report results for experiments on our R2-D2 in which the authors use a 64 channels embedding

结果

- In order to produce the features X for the base learners, as many recent methods the authors use a shallow network of four convolutional “blocks”, each consisting of the following sequence: a 3×3 convolution, batch-normalization, 2×2 max-pooling, and a leaky-ReLU with a factor of 0.1.
- Dropout is applied to the last two blocks for the experiments on miniImageNet and CIFAR-FS, respectively with probabilities 0.1 and 0.4.
- The authors flatten and concatenate the output of the third and fourth convolutional blocks and feed it to the base learner.
- The authors obtain high-dimensional features of size 3584, 72576 and 8064 for Omniglot, miniImageNet and CIFAR-FS respectively.
- Applying the Woodbury identity, the authors obtain significant gains in computation, as in eq 5 the authors invert a matrix that is only 5×5 instead of 72576×72576

结论

- With the aim of allowing efficient adaptation to unseen learning problems, in this paper the authors explored the feasibility of incorporating fast solvers with closed-form solutions as the base learning component of a meta-learning system.
- R2-D2, the differentiable ridge regression base learner the authors introduce, is almost as fast as prototypical networks and strikes a useful compromise between not performing adaptation for new episodes and conducting a costly iterative approach.
- The authors showed that the base learners work remarkably well, with excellent results on few-shot learning benchmarks, generalizing to episodes with new classes that were not seen during training.
- The authors would like to explore Newton’s methods with more complicated second-order structure than ridge regression

- Table1: Few-shot multi-class classification accuracies on miniImageNet and CIFAR-FS
- Table2: Few-shot multi-class classification accuracies on Omniglot
- Table3: Few-shot binary classification accuracies on miniImageNet and CIFAR-FS
- Table4: Time required to solve 10,000 miniImageNet episodes of 10 samples each

相关工作

- The topic of meta-learning gained importance in the machine learning community several decades ago, with the first examples already appearing in the eighties and early nineties (Utgoff, 1986; Schmidhuber, 1987; Naik & Mammone, 1992; Bengio et al, 1992; Thrun & Pratt, 1998). Utgoff (1986) proposed a framework describing when and how it is useful to dynamically adjust the inductive bias of a learning algorithm, thus implicitly “changing the ordering” of the elements of its hypothesis space (Vilalta & Drissi, 2002). Later, Bengio et al (1992) interpreted the update rule of a neural network’s weights as a function that is learnable. Another seminal work is the one of Thrun (1996), which presents the so-called lifelong learning scenario, where a learning algorithm gradually encounters an ordered sequence of learning problems. Throughout this course, the learner can benefit from re-using the knowledge accumulated during previous tasks. In later work, Thrun & Pratt (1998) stated that an algorithm is learning to learn if “[...] its performance at each task improves with experience and with the number of tasks”. This characterisation has been inspired by Mitchell et al (1997)’s definition of a learning algorithm as a computer program whose performance on a task improves with experience. Similarly, Vilalta & Drissi (2002) explained meta-learning as organised in two “nested learning levels”. At the base level, an algorithm is confined within a limited hypothesis space while solving a single learning problem. Contrarily, the meta-level can “accrue knowledge” by spanning multiple problems, so that the hypothesis space at the base level can be adapted effectively.

基金

- This work was partially supported by the ERC grant 638009-IDIU

研究对象与分析

samples: 10

Few-shot binary classification accuracies on miniImageNet and CIFAR-FS. Time required to solve 10,000 miniImageNet episodes of 10 samples each. Diagram of the proposed method for one episode, of which several are seen during meta-training. The task is to learn new classes given just a few sample images per class. In this illustrative example, there are 3 classes and 2 samples per class, making each episode a 3-way, 2-shot classification problem. At the base learning level, learning is accomplished by a differentiable ridge regression layer (R.R.), which computes episode-specific weights (referred to as wE in Section 3.1 and as W in Section 3.2). At the meta-training level, by back-propagating errors through many of these small learning problems, we train a network whose weights are shared across episodes, together with the hyper-parameters of the R.R. layer. In this way, the R.R. base learner can improve its learning capabilities as the number of experienced episodes increases

引用论文

- Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug discovery with one-shot learning. ACS central science, 2017.
- Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.
- Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Univ. of Texas, 1992.
- Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 2000.
- Luca Bertinetto, João F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, 2016.
- Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.
- Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak Shah. Signature verification using a “Siamese” time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 1993.
- Susan Carey. Less may never mean more. Recent advances in the psychology of language, 1978.
- Susan Carey and Elsa Bartlett. Acquiring a single new word. 1978.
- Rich Caruana. Multitask learning. In Learning to learn. Springer, 1998.
- Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition, 2005.
- Brian Chu, Vashisht Madhavan, Oscar Beijbom, Judy Hoffman, and Trevor Darrell. Best practices for fine-tuning visual classifiers to new domains. In European Conference on Computer Vision workshops. Springer, 2016.
- Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006.
- Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. 2018.
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
- Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.
- Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
- Bharath Hariharan and Ross B Girshick. Low-shot visual recognition by shrinking and hallucinating features. In IEEE International Conference on Computer Vision, 2017.
- Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Training deep networks with structured layers by matrix backpropagation. arXiv preprint arXiv:1509.07838, 2015.
- Łukasz Kaiser, Ofir Nachum, Aurko Roy, and Samy Bengio. Learning to remember rare events. In International Conference on Learning Representations, 2017.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2015.
- Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In International Conference on Machine Learning workshops, 2015.
- Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
- Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 2015.
- Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122, 2015.
- Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. 1989.
- Erik G Miller, Nicholas E Matsakis, and Paul A Viola. Learning from one example through shared densities on transforms. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2000.
- Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive metalearner. In International Conference on Learning Representations, 2018.
- Tom M Mitchell. The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ. New Jersey, 1980.
- Tom M Mitchell et al. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 1997.
- Tsendsuren Munkhdalai and Hong Yu. Meta networks. In International Conference on Machine Learning, 2017.
- Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
- Devang K Naik and RJ Mammone. Meta-neural networks that learn by learning. In Neural Networks, 1992. IJCNN., International Joint Conference on. IEEE, 1992.
- Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. CoRR, 2018. URL http://arxiv.org/abs/1803.02999.
- Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of Denmark, 2008.
- Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L. Yuille. Few-shot image recognition by predicting parameters from activations. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017.
- Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, 2017.
- Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. In International Conference on Learning Representations, 2018.
- Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. 2015.
- Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memory-augmented neural networks. In International Conference on Machine Learning, 2016.
- Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universität München, 1987.
- Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 1992.
- Jürgen Schmidhuber. A neural network that embeds its own meta-levels. In Neural Networks, 1993., IEEE International Conference on. IEEE, 1993.
- Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017.
- Pablo Sprechmann, Siddhant M Jayakumar, Jack W Rae, Alexander Pritzel, Adrià Puigdomènech Badia, Benigno Uria, Oriol Vinyals, Demis Hassabis, Razvan Pascanu, and Charles Blundell. Memory-based parameter adaptation. 2018.
- Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Albert Tarantola. Inverse problem theory and methods for model parameter estimation, volume 89. siam, 2005.
- Sebastian Thrun. Is learning the n-th thing any easier than learning the first? In Advances in Neural Information Processing Systems, 1996.
- Sebastian Thrun. Lifelong learning algorithms. In Learning to learn. Springer, 1998.
- Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 1998.
- Paul E Utgoff. Shift of bias for inductive concept learning. Machine learning: An artificial intelligence approach, 1986.
- Jack Valmadre, Luca Bertinetto, João Henriques, Andrea Vedaldi, and Philip HS Torr. End-to-end representation learning for correlation filter based tracking. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artificial Intelligence Review, 2002.
- Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, 2016.
- Yuxin Wu and Kaiming He. Group normalization. CoRR, 2018. URL http://arxiv.org/abs/1803.08494.
- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, 2014.
- A Steven Younger, Sepp Hochreiter, and Peter R Conwell. Meta-learning with backpropagation. In Neural Networks, 2001. Proceedings. IJCNN’01. International Joint Conference on. IEEE, 2001.
- Contributions within the few-shot learning paradigm. In this work, we evaluated our proposed methods R2-D2 and LR-D2 in the few-shot learning scenario (Fei-Fei et al., 2006; Lake et al., 2015; Vinyals et al., 2016; Ravi & Larochelle, 2017; Hariharan & Girshick, 2017), which consists in learning how to discriminate between images given one or very few examples. For methods tackling this problem, it is common practice to organise the training procedure in two nested loops. The inner loop is used to solve the actual few-shot classification problem, while the outer loop serves as a guidance for the former by gradually modifying the inductive bias of the base learner (Vilalta & Drissi, 2002). Differently from standard classification benchmarks, the few-shot ones enforce that classes are disjoint between dataset splits.
- Within this landscape, our work proposes a novel technique (R2-D2) that does allow per-episode adaptation while at the same time being fast (Table 4) and achieving strong performance (Table 1). The key innovation is to use a simple (and differentiable) solver such as ridge regression within the inner loop, which requires back-propagating through the solution of a learning problem. Crucially, its closed-form solution and the use of the Woodbury identity (particularly advantageous in the low data regime) allow this non-trivial endeavour to be efficient. We further demonstrate that this strategy is not limited to the ridge regression case, but it can also be extended to other solvers (LR-D2) by dividing the problem into a short series of weighted least squares problems ((Murphy, 2012, Chapter 8.3.4)).
- The importance of considering adaptation during training. Considering adaptation during training is also one of the main traits that differentiate our approach from basic transfer learning approaches in which a neural network is first pre-trained on one dataset/task and then adapted to a different dataset/task by simply adapting the final layer(s) (e.g. Yosinski et al. (2014); Chu et al. (2016)).
- The regularization term can be seen as a prior gaussian distribution of the parameters in a Bayesian interpretation, or more simply Tikhonov regularization (Tarantola, 2005). In the most common case of λI, it corresponds to an isotropic gaussian prior on the parameters.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn