Meta-learning with differentiable closed-form solvers.

international conference on learning representations, (2019)

引用449|浏览234
EI
下载 PDF 全文
引用
微博一下

摘要

Adapting deep networks to new concepts from few examples is extremely challenging, due to the high computational and data requirements of standard fine-tuning procedures. Most works on meta-learning and few-shot learning have thus focused on simple learning techniques for adaptation, such as nearest neighbors or gradient descent. Nonethel...更多

代码

数据

0
简介
重点内容
  • Humans can efficiently perform fast mapping (Carey, 1978; Carey & Bartlett, 1978), i.e. learning a new concept after a single exposure
  • The base learner works at the level of individual episodes, which correspond to learning problems characterised by having only a small set of labelled training images available
  • We propose to adopt simple learning algorithms that admit a closed-form solution such as ridge regression
  • With the aim of allowing efficient adaptation to unseen learning problems, in this paper we explored the feasibility of incorporating fast solvers with closed-form solutions as the base learning component of a meta-learning system
  • R2-D2, the differentiable ridge regression base learner we introduce, is almost as fast as prototypical networks and strikes a useful compromise between not performing adaptation for new episodes and conducting a costly iterative approach
  • Our proposed method achieves an average accuracy that, on miniImageNet and CIFAR-FS, is superior to the state of the art with shallow architectures
  • We showed that our base learners work remarkably well, with excellent results on few-shot learning benchmarks, generalizing to episodes with new classes that were not seen during training
方法
  • 3.1 META-LEARNING

    According to widely accepted definitions of learning (Mitchell, 1980) and meta-learning (Vilalta & Drissi, 2002; Vinyals et al, 2016), an algorithm is “learning to learn” if it can improve its learning skills with the number of experienced episodes.
  • The meta-learner learns from several such episodes in sequence with the goal of improving the performance of the base learner across episodes.
  • In order to ensure a fair comparison, the authors increased the capacity of the architectures of three representative methods (MAML, prototypical networks and GNN) to match ours.
  • The results of these experiments are reported with a ∗ on Table 1.
  • The authors report results for experiments on our R2-D2 in which the authors use a 64 channels embedding
结果
  • In order to produce the features X for the base learners, as many recent methods the authors use a shallow network of four convolutional “blocks”, each consisting of the following sequence: a 3×3 convolution, batch-normalization, 2×2 max-pooling, and a leaky-ReLU with a factor of 0.1.
  • Dropout is applied to the last two blocks for the experiments on miniImageNet and CIFAR-FS, respectively with probabilities 0.1 and 0.4.
  • The authors flatten and concatenate the output of the third and fourth convolutional blocks and feed it to the base learner.
  • The authors obtain high-dimensional features of size 3584, 72576 and 8064 for Omniglot, miniImageNet and CIFAR-FS respectively.
  • Applying the Woodbury identity, the authors obtain significant gains in computation, as in eq 5 the authors invert a matrix that is only 5×5 instead of 72576×72576
结论
  • With the aim of allowing efficient adaptation to unseen learning problems, in this paper the authors explored the feasibility of incorporating fast solvers with closed-form solutions as the base learning component of a meta-learning system.
  • R2-D2, the differentiable ridge regression base learner the authors introduce, is almost as fast as prototypical networks and strikes a useful compromise between not performing adaptation for new episodes and conducting a costly iterative approach.
  • The authors showed that the base learners work remarkably well, with excellent results on few-shot learning benchmarks, generalizing to episodes with new classes that were not seen during training.
  • The authors would like to explore Newton’s methods with more complicated second-order structure than ridge regression
表格
  • Table1: Few-shot multi-class classification accuracies on miniImageNet and CIFAR-FS
  • Table2: Few-shot multi-class classification accuracies on Omniglot
  • Table3: Few-shot binary classification accuracies on miniImageNet and CIFAR-FS
  • Table4: Time required to solve 10,000 miniImageNet episodes of 10 samples each
Download tables as Excel
相关工作
  • The topic of meta-learning gained importance in the machine learning community several decades ago, with the first examples already appearing in the eighties and early nineties (Utgoff, 1986; Schmidhuber, 1987; Naik & Mammone, 1992; Bengio et al, 1992; Thrun & Pratt, 1998). Utgoff (1986) proposed a framework describing when and how it is useful to dynamically adjust the inductive bias of a learning algorithm, thus implicitly “changing the ordering” of the elements of its hypothesis space (Vilalta & Drissi, 2002). Later, Bengio et al (1992) interpreted the update rule of a neural network’s weights as a function that is learnable. Another seminal work is the one of Thrun (1996), which presents the so-called lifelong learning scenario, where a learning algorithm gradually encounters an ordered sequence of learning problems. Throughout this course, the learner can benefit from re-using the knowledge accumulated during previous tasks. In later work, Thrun & Pratt (1998) stated that an algorithm is learning to learn if “[...] its performance at each task improves with experience and with the number of tasks”. This characterisation has been inspired by Mitchell et al (1997)’s definition of a learning algorithm as a computer program whose performance on a task improves with experience. Similarly, Vilalta & Drissi (2002) explained meta-learning as organised in two “nested learning levels”. At the base level, an algorithm is confined within a limited hypothesis space while solving a single learning problem. Contrarily, the meta-level can “accrue knowledge” by spanning multiple problems, so that the hypothesis space at the base level can be adapted effectively.
基金
  • This work was partially supported by the ERC grant 638009-IDIU
研究对象与分析
samples: 10
Few-shot binary classification accuracies on miniImageNet and CIFAR-FS. Time required to solve 10,000 miniImageNet episodes of 10 samples each. Diagram of the proposed method for one episode, of which several are seen during meta-training. The task is to learn new classes given just a few sample images per class. In this illustrative example, there are 3 classes and 2 samples per class, making each episode a 3-way, 2-shot classification problem. At the base learning level, learning is accomplished by a differentiable ridge regression layer (R.R.), which computes episode-specific weights (referred to as wE in Section 3.1 and as W in Section 3.2). At the meta-training level, by back-propagating errors through many of these small learning problems, we train a network whose weights are shared across episodes, together with the hyper-parameters of the R.R. layer. In this way, the R.R. base learner can improve its learning capabilities as the number of experienced episodes increases

引用论文
  • Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug discovery with one-shot learning. ACS central science, 2017.
    Google ScholarLocate open access versionFindings
  • Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Univ. of Texas, 1992.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 2000.
    Google ScholarLocate open access versionFindings
  • Luca Bertinetto, João F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.
    Google ScholarFindings
  • Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak Shah. Signature verification using a “Siamese” time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 1993.
    Google ScholarLocate open access versionFindings
  • Susan Carey. Less may never mean more. Recent advances in the psychology of language, 1978.
    Google ScholarFindings
  • Susan Carey and Elsa Bartlett. Acquiring a single new word. 1978.
    Google ScholarFindings
  • Rich Caruana. Multitask learning. In Learning to learn. Springer, 1998.
    Google ScholarFindings
  • Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition, 2005.
    Google ScholarLocate open access versionFindings
  • Brian Chu, Vashisht Madhavan, Oscar Beijbom, Judy Hoffman, and Trevor Darrell. Best practices for fine-tuning visual classifiers to new domains. In European Conference on Computer Vision workshops. Springer, 2016.
    Google ScholarLocate open access versionFindings
  • Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. 2018.
    Google ScholarFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
    Google ScholarLocate open access versionFindings
  • Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
    Google ScholarLocate open access versionFindings
  • Bharath Hariharan and Ross B Girshick. Low-shot visual recognition by shrinking and hallucinating features. In IEEE International Conference on Computer Vision, 2017.
    Google ScholarLocate open access versionFindings
  • Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Training deep networks with structured layers by matrix backpropagation. arXiv preprint arXiv:1509.07838, 2015.
    Findings
  • Łukasz Kaiser, Ofir Nachum, Aurko Roy, and Samy Bengio. Learning to remember rare events. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2015.
    Google ScholarFindings
  • Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In International Conference on Machine Learning workshops, 2015.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
    Google ScholarFindings
  • Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 2015.
    Google ScholarLocate open access versionFindings
  • Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122, 2015.
    Google ScholarLocate open access versionFindings
  • Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. 1989.
    Google ScholarFindings
  • Erik G Miller, Nicholas E Matsakis, and Paul A Viola. Learning from one example through shared densities on transforms. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2000.
    Google ScholarLocate open access versionFindings
  • Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive metalearner. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Tom M Mitchell. The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ. New Jersey, 1980.
    Google ScholarFindings
  • Tom M Mitchell et al. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 1997.
    Google ScholarFindings
  • Tsendsuren Munkhdalai and Hong Yu. Meta networks. In International Conference on Machine Learning, 2017.
    Google ScholarLocate open access versionFindings
  • Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
    Google ScholarFindings
  • Devang K Naik and RJ Mammone. Meta-neural networks that learn by learning. In Neural Networks, 1992. IJCNN., International Joint Conference on. IEEE, 1992.
    Google ScholarLocate open access versionFindings
  • Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. CoRR, 2018. URL http://arxiv.org/abs/1803.02999.
    Findings
  • Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
    Findings
  • Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University of Denmark, 2008.
    Google ScholarFindings
  • Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L. Yuille. Few-shot image recognition by predicting parameters from activations. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
    Google ScholarLocate open access versionFindings
  • Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
    Findings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. 2015.
    Google ScholarFindings
  • Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memory-augmented neural networks. In International Conference on Machine Learning, 2016.
    Google ScholarLocate open access versionFindings
  • Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universität München, 1987.
    Google ScholarFindings
  • Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 1992.
    Google ScholarLocate open access versionFindings
  • Jürgen Schmidhuber. A neural network that embeds its own meta-levels. In Neural Networks, 1993., IEEE International Conference on. IEEE, 1993.
    Google ScholarLocate open access versionFindings
  • Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Pablo Sprechmann, Siddhant M Jayakumar, Jack W Rae, Alexander Pritzel, Adrià Puigdomènech Badia, Benigno Uria, Oriol Vinyals, Demis Hassabis, Razvan Pascanu, and Charles Blundell. Memory-based parameter adaptation. 2018.
    Google ScholarFindings
  • Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
    Google ScholarLocate open access versionFindings
  • Albert Tarantola. Inverse problem theory and methods for model parameter estimation, volume 89. siam, 2005.
    Google ScholarFindings
  • Sebastian Thrun. Is learning the n-th thing any easier than learning the first? In Advances in Neural Information Processing Systems, 1996.
    Google ScholarLocate open access versionFindings
  • Sebastian Thrun. Lifelong learning algorithms. In Learning to learn. Springer, 1998.
    Google ScholarFindings
  • Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 1998.
    Google ScholarFindings
  • Paul E Utgoff. Shift of bias for inductive concept learning. Machine learning: An artificial intelligence approach, 1986.
    Google ScholarFindings
  • Jack Valmadre, Luca Bertinetto, João Henriques, Andrea Vedaldi, and Philip HS Torr. End-to-end representation learning for correlation filter based tracking. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
    Google ScholarLocate open access versionFindings
  • Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artificial Intelligence Review, 2002.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Yuxin Wu and Kaiming He. Group normalization. CoRR, 2018. URL http://arxiv.org/abs/1803.08494.
    Findings
  • Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, 2014.
    Google ScholarLocate open access versionFindings
  • A Steven Younger, Sepp Hochreiter, and Peter R Conwell. Meta-learning with backpropagation. In Neural Networks, 2001. Proceedings. IJCNN’01. International Joint Conference on. IEEE, 2001.
    Google ScholarLocate open access versionFindings
  • Contributions within the few-shot learning paradigm. In this work, we evaluated our proposed methods R2-D2 and LR-D2 in the few-shot learning scenario (Fei-Fei et al., 2006; Lake et al., 2015; Vinyals et al., 2016; Ravi & Larochelle, 2017; Hariharan & Girshick, 2017), which consists in learning how to discriminate between images given one or very few examples. For methods tackling this problem, it is common practice to organise the training procedure in two nested loops. The inner loop is used to solve the actual few-shot classification problem, while the outer loop serves as a guidance for the former by gradually modifying the inductive bias of the base learner (Vilalta & Drissi, 2002). Differently from standard classification benchmarks, the few-shot ones enforce that classes are disjoint between dataset splits.
    Google ScholarLocate open access versionFindings
  • Within this landscape, our work proposes a novel technique (R2-D2) that does allow per-episode adaptation while at the same time being fast (Table 4) and achieving strong performance (Table 1). The key innovation is to use a simple (and differentiable) solver such as ridge regression within the inner loop, which requires back-propagating through the solution of a learning problem. Crucially, its closed-form solution and the use of the Woodbury identity (particularly advantageous in the low data regime) allow this non-trivial endeavour to be efficient. We further demonstrate that this strategy is not limited to the ridge regression case, but it can also be extended to other solvers (LR-D2) by dividing the problem into a short series of weighted least squares problems ((Murphy, 2012, Chapter 8.3.4)).
    Google ScholarLocate open access versionFindings
  • The importance of considering adaptation during training. Considering adaptation during training is also one of the main traits that differentiate our approach from basic transfer learning approaches in which a neural network is first pre-trained on one dataset/task and then adapted to a different dataset/task by simply adapting the final layer(s) (e.g. Yosinski et al. (2014); Chu et al. (2016)).
    Google ScholarLocate open access versionFindings
  • The regularization term can be seen as a prior gaussian distribution of the parameters in a Bayesian interpretation, or more simply Tikhonov regularization (Tarantola, 2005). In the most common case of λI, it corresponds to an isotropic gaussian prior on the parameters.
    Google ScholarLocate open access versionFindings
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn