A New Meta-Baseline for Few-Shot Learning

Cited by: 13|Bibtex|Views252
Other Links: arxiv.org
Weibo:
While many novel meta-learning methods are proposed and some recent work argue that a pre-trained classifier is good enough for few-shot learning, our Meta-Baseline takes the strengths of both classification pre-training and meta-learning

Abstract:

Meta-learning has become a popular framework for few-shot learning in recent years, with the goal of learning a model from collections of few-shot classification tasks. While more and more novel meta-learning models are being proposed, our research has uncovered simple baselines that have been overlooked. We present a Meta-Baseline meth...More

Code:

Data:

0
Introduction
  • While human has shown incredible ability to learn from very few examples and generalize to many different new examples, the current deep learning approaches still rely on a large scale of training data.
  • Meta-learning framework for few-shot learning follows the key idea of learning to learn
  • It samples fewshot learning tasks from training samples belonging to the base classes, and optimizes the model to perform well on these tasks.
  • Under the framework of meta-learning, the model is directly optimized to perform well on few-shot tasks
  • Motivated by this idea, recent works focus on improving the meta-learning structure, and few-shot learning itself has become a common test bed for evaluating meta-learning algorithms.
  • While more and more meta-learning approaches (Snell et al, 2017; Sung et al, 2018; Gidaris & Komodakis, 2018; Sun et al, 2019; Wang et al, 2019; Finn et al, 2017; Rusu et al, 2019; Lee et al, 2019) are proposed for few-shot learning, very few efforts (Gidaris & Komodakis, 2018; Chen et al, 2019) have been made on improving those baseline methods
Highlights
  • While human has shown incredible ability to learn from very few examples and generalize to many different new examples, the current deep learning approaches still rely on a large scale of training data
  • We evaluate on two types of generalization in the context of metalearning: (i) base class generalization denotes generalization on few-shot classification tasks from unseen data in the base classes, which follows the common definition of generalization but we rename it for clarification; and (ii) novel class generalization denotes generalization on fewshot classification tasks from data in novel classes, which further indicates the transferability of the representation from base classes to novel classes
  • We find that while Meta-Baseline is improving base class generalization, its novel class generalization can be decreasing instead, which indicates that: (i) There is potentially an objective discrepancy in the meta-learning stage, and it is possible that improving base class generalization leads to worse novel class generalization. (ii) Since the model is pre-trained before meta-learning, classification pre-training may have provided extra transferability for the meta-learning model. (iii) The advantages of meta-learning over the Classifier-Baseline should be more obvious when novel classes are more similar to base classes
  • While many novel meta-learning methods are proposed and some recent work argue that a pre-trained classifier is good enough for few-shot learning, our Meta-Baseline takes the strengths of both classification pre-training and meta-learning
  • We demonstrat that both pre-training and inheriting a good few-shot classification metric are important for Meta-Baseline to achieve strong performance, which help our model better utilize the pre-trained representations with potentially stronger class transferability
  • Our experiments indicate that there might exist a potential objective discrepancy in the meta-learning framework for few-shot learning, namely a meta-learning model generalizing better on unseen tasks from base classes might have worse performance on tasks from novel classes
Methods
  • Classifier-Baseline Classifier-Baseline (Euclidean)

    Meta-Baseline Meta-Baseline (Euclidean)

    ResNet-12 on miniImageNet.

    curacy (%).
  • The authors compare the chosen epochs with the training epochs selected from Meta-Baseline with pre-training, which is shown in Table 5.
  • The authors observe that while Meta-Baseline trained from scratch achieves higher base class generalization, their novel class generalization is much lower when compared to Meta-Baseline with pre-training.
  • The authors' results indicate that a potential important effect of classification pre-training is improving the transferability of the meta-learning model.
  • The authors further observe consistent improvement of pre-training in Meta-Baseline as shown in Figure 1a
Results
  • Results on standard benchmarks

    Following the standard setting, the authors conduct experiments on miniImageNet and tieredImageNet, the results are shown in Table 1 and Table 2 respectively.
  • While prior works have different data augmentation strategies, the authors choose not to apply data augmentation in the meta-learning stage
  • On both datasets, the authors observe the Meta-Baseline outperforms previous state-of-the-art meta-learning methods with a large-margin despite its simple design.
  • Meta-Dataset (Triantafillou et al, 2019) is a new benchmark proposed for few-shot learning, it consists of diverse datasets for training and evaluation
  • They propose to generate few-shot tasks with variable number of ways and shots, for having a setting closer to the real-world.
  • The Meta-Baseline does not improve Classifier-Baseline under this setting in the experiments, possibly due to the average number of shots are high
Conclusion
  • The authors presented a simple yet effective method for few-shot learning, namely Meta-Baseline.
  • The authors' experiments indicate that there might exist a potential objective discrepancy in the meta-learning framework for few-shot learning, namely a meta-learning model generalizing better on unseen tasks from base classes might have worse performance on tasks from novel classes.
  • This probably explains why some complex meta-learning methods could not get significant better performance.
  • The authors' observations suggest that the objective discrepancy of classes might be the potential key challenge to tackle in the meta-learning framework for few-shot learning
Summary
  • Introduction:

    While human has shown incredible ability to learn from very few examples and generalize to many different new examples, the current deep learning approaches still rely on a large scale of training data.
  • Meta-learning framework for few-shot learning follows the key idea of learning to learn
  • It samples fewshot learning tasks from training samples belonging to the base classes, and optimizes the model to perform well on these tasks.
  • Under the framework of meta-learning, the model is directly optimized to perform well on few-shot tasks
  • Motivated by this idea, recent works focus on improving the meta-learning structure, and few-shot learning itself has become a common test bed for evaluating meta-learning algorithms.
  • While more and more meta-learning approaches (Snell et al, 2017; Sung et al, 2018; Gidaris & Komodakis, 2018; Sun et al, 2019; Wang et al, 2019; Finn et al, 2017; Rusu et al, 2019; Lee et al, 2019) are proposed for few-shot learning, very few efforts (Gidaris & Komodakis, 2018; Chen et al, 2019) have been made on improving those baseline methods
  • Methods:

    Classifier-Baseline Classifier-Baseline (Euclidean)

    Meta-Baseline Meta-Baseline (Euclidean)

    ResNet-12 on miniImageNet.

    curacy (%).
  • The authors compare the chosen epochs with the training epochs selected from Meta-Baseline with pre-training, which is shown in Table 5.
  • The authors observe that while Meta-Baseline trained from scratch achieves higher base class generalization, their novel class generalization is much lower when compared to Meta-Baseline with pre-training.
  • The authors' results indicate that a potential important effect of classification pre-training is improving the transferability of the meta-learning model.
  • The authors further observe consistent improvement of pre-training in Meta-Baseline as shown in Figure 1a
  • Results:

    Results on standard benchmarks

    Following the standard setting, the authors conduct experiments on miniImageNet and tieredImageNet, the results are shown in Table 1 and Table 2 respectively.
  • While prior works have different data augmentation strategies, the authors choose not to apply data augmentation in the meta-learning stage
  • On both datasets, the authors observe the Meta-Baseline outperforms previous state-of-the-art meta-learning methods with a large-margin despite its simple design.
  • Meta-Dataset (Triantafillou et al, 2019) is a new benchmark proposed for few-shot learning, it consists of diverse datasets for training and evaluation
  • They propose to generate few-shot tasks with variable number of ways and shots, for having a setting closer to the real-world.
  • The Meta-Baseline does not improve Classifier-Baseline under this setting in the experiments, possibly due to the average number of shots are high
  • Conclusion:

    The authors presented a simple yet effective method for few-shot learning, namely Meta-Baseline.
  • The authors' experiments indicate that there might exist a potential objective discrepancy in the meta-learning framework for few-shot learning, namely a meta-learning model generalizing better on unseen tasks from base classes might have worse performance on tasks from novel classes.
  • This probably explains why some complex meta-learning methods could not get significant better performance.
  • The authors' observations suggest that the objective discrepancy of classes might be the potential key challenge to tackle in the meta-learning framework for few-shot learning
Tables
  • Table1: Comparison to prior works on miniImageNet. Average 5-way accuracy (%) is reported with 95% confidence interval. refers to applying DropBlock (<a class="ref-link" id="cGhiasi_et+al_2018_a" href="#rGhiasi_et+al_2018_a">Ghiasi et al, 2018</a>) and label smoothing
  • Table2: Comparison to prior works on tieredImageNet. Average 5-way accuracy (%) is reported with 95% confidence interval. * refers to results from (<a class="ref-link" id="cLee_et+al_2019_a" href="#rLee_et+al_2019_a">Lee et al, 2019</a>)
  • Table3: Evaluation on single-class few-shot tasks. Average ROAUC score is reported with 95% confidence interval. The backbone is ResNet-12 in all the experiments
  • Table4: Results on ImageNet-800 split. Average 5-way accuracy (%) is reported with 95% confidence interval
  • Table5: Effect of pre-training. Average 5-way accuracy. With Table 6. Effect of inheriting a good metric. Average 5-way ac-
  • Table6: Table 6
  • Table7: Effect of dataset properties. Average 5-way accuracy (%). Datasets are shown in Table 8. With ResNet-12 on miniImageNet
  • Table8: Variants constructed from tieredImageNet. Scale refers to the size of training set, novel similarity refers to the similarity between base and novel classes
  • Table9: Additional results on Meta-Dataset. Average accuracy (%), with variable number of ways and shots. The fo-Proto-MAML method is from (<a class="ref-link" id="cTriantafillou_et+al_2019_a" href="#rTriantafillou_et+al_2019_a">Triantafillou et al, 2019</a>), Classifier and Meta refers to Classifier-Baseline and Meta-Baseline respectively, 1000 tasks are sampled for evaluating Classifier or Meta. Note that Traffic Signs and MSCOCO have no training set
  • Table10: Comparison between Classifier-Baseline and (<a class="ref-link" id="cChen_et+al_2019_a" href="#rChen_et+al_2019_a">Chen et al, 2019</a>)
  • Table11: Comparison to classifier pre-trained with cosine metric, with backbone ResNet-12
Download tables as Excel
Related work
  • The goal of few-shot learning is adapting the classification model to new classes with only a few labelled samples. Early attempt (Fei-Fei et al, 2006) propose to utilize the knowledge learned in previous classes with a Bayesian implementation. The recent few-shot learning approaches most follow the meta-learning framework, which trains the model with few-shot tasks sampled in training data. Under this umbrella, various meta-learning architectures for few-shot learning are designed, which can be roughly cat-

    Pre-training Stage

    Classifier-Baseline training

    F C classification on base classes

    Meta-Learning Stage representation transfer mean

    Classifier-Baseline / Meta-Baseline evaluation cos score label support-set query-set

    Meta-Baseline training loss egorized into three main types: memory-based methods, optimization-based methods and metric-based methods.

    Memory-based methods. The key idea of memory-based methods is to train a meta-learner with memory to learn novel concepts. (Ravi & Larochelle, 2017) proposes to learn the few-shot optimization algorithm with an LSTM-based meta-learner. (Munkhdalai et al, 2017) modifies the activation values of a network by shifting them according to the task-specific information. In (Santoro et al, 2016), a particular class of memory-augmented neural network is used for meta-learning. (Mishra et al, 2018) proposes a Simple Neural Attentive Meta-Learner using a combination of temporal convolutions and soft attention. In MetaNet (Munkhdalai & Yu, 2017), a fast parameterization approach is proposed for learning meta-level knowledge across tasks.
Funding
  • Darrells group was supported in part by DoD, BAIR, and BDD
Reference
  • Chen, W.-Y., Liu, Y.-C., Kira, Z., Wang, Y.-C. F., and Huang, J.-B. A closer look at few-shot classification. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Dvornik, N., Schmid, C., and Mairal, J. Selecting relevant features from a universal representation for few-shot classification. arXiv preprint arXiv:2003.09338, 2020.
    Findings
  • Fei-Fei, L., Fergus, R., and Perona, P. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
    Google ScholarLocate open access versionFindings
  • Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Ghiasi, G., Lin, T.-Y., and Le, Q. V. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pp. 10727– 10737, 2018.
    Google ScholarLocate open access versionFindings
  • Gidaris, S. and Komodakis, N. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4367–4375, 2018.
    Google ScholarLocate open access versionFindings
  • Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T. Recasting gradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.
    Findings
  • He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
    Findings
  • Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Metalearning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10657–10665, 2019.
    Google ScholarLocate open access versionFindings
  • Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Munkhdalai, T. and Yu, H. Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2554–2563. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Munkhdalai, T., Yuan, X., Mehri, S., and Trischler, A. Rapid adaptation with conditionally shifted neurons. arXiv preprint arXiv:1712.09926, 2017.
    Findings
  • Oreshkin, B., Lopez, P. R., and Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731, 2018.
    Google ScholarLocate open access versionFindings
  • Qi, H., Brown, M., and Lowe, D. G. Low-shot learning with imprinted weights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5822–5830, 2018.
    Google ScholarLocate open access versionFindings
  • Qiao, S., Liu, C., Shen, W., and Yuille, A. L. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238, 2018.
    Google ScholarLocate open access versionFindings
  • Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In In International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J. B., Larochelle, H., and Zemel, R. S. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
    Findings
  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211–252, 2015.
    Google ScholarLocate open access versionFindings
  • Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850, 2016.
    Google ScholarLocate open access versionFindings
  • Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087, 2017.
    Google ScholarLocate open access versionFindings
  • Sun, Q., Liu, Y., Chua, T.-S., and Schiele, B. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 403–412, 2019.
    Google ScholarLocate open access versionFindings
  • Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208, 2018.
    Google ScholarLocate open access versionFindings
  • Triantafillou, E., Zhu, T., Dumoulin, V., Lamblin, P., Evci, U., Xu, K., Goroshin, R., Gelada, C., Swersky, K., Manzagol, P.-A., et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096, 2019.
    Findings
  • Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638, 2016.
    Google ScholarLocate open access versionFindings
  • Wang, X., Yu, F., Wang, R., Darrell, T., and Gonzalez, J. E. Tafe-net: Task-aware feature embeddings for low shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1831– 1840, 2019.
    Google ScholarLocate open access versionFindings
  • Xu, B., Wang, N., Chen, T., and Li, M. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
    Findings
  • A. Comparison to (Chen et al., 2019)
    Google ScholarFindings
  • We connect our Classifier-Baseline to the one proposed in (Chen et al., 2019) by conducting ablation studies on miniImageNet, the results are shown in Table 10. We observe that fine-tuning is outperformed by simple nearest-controid method with cosine metric, and using a standard ImageNetlike optimizer significantly improves the performance of the pre-trained classifier.
    Google ScholarFindings
  • Classifier-Baseline. In training, we do not replace the last FC layer with cosine metric as in (Gidaris & Komodakis, 2018; Chen et al., 2019). In evaluation, while (Vinyals et al., 2016) uses a nearest-neighbor method, (Chen et al., 2019) fine-tunes a new FC layer, we use nearest-centroid method instead.
    Google ScholarLocate open access versionFindings
  • Pre-training. There are several recent works (Rusu et al., 2019; Qiao et al., 2018; Gidaris & Komodakis, 2018; Sun et al., 2019) also perform classification pre-training before meta-learning, while most of them freeze the pre-trained parameters and train additional parameters with the fixed pre-trained representations. Differently, Meta-Baseline does not freeze the pre-trained representations or introduce any additional parameters, it directly learns the pre-trained parameters instead, which is much simpler and arguably a better strict comparison to the pre-trained classifier (e.g. Classifier-Baseline).
    Google ScholarLocate open access versionFindings
  • Model architecture. Comparing to Matching Network (Vinyals et al., 2016) and Prototypical Network (Snell et al., 2017), Meta-Baseline computes the class centers as in (Snell et al., 2017) but different to (Vinyals et al., 2016), while it uses cosine similarity as in (Vinyals et al., 2016) but different to (Snell et al., 2017). While (Snell et al., 2017) analyses the advantage of using Euclidean distance as a Bregman divergence, we observe that inheriting a good metric for the representations of the pre-trained classifier leads to better performance. We also note the scaling factor (Gidaris & Komodakis, 2018; Qi et al., 2018; Oreshkin et al., 2018) is important for cosine similarity. In addition, Meta-Baseline does not have FCE as in (Vinyals et al., 2016) and does not train with higher few-shot classification ways as in (Snell et al., 2017).
    Google ScholarLocate open access versionFindings
  • In (Vinyals et al., 2016), they observe when novel classes become fine-grained comparing to base classes, meta-learning objective can not improve a pre-trained classifier. Since being fine-grained is likely requiring the model to classify classes (e.g. different dogs) which should belong to the same concept at training time, (Vinyals et al., 2016) hypothesize by sampling training classes also from fine-grained classes, the improvements could be attained. Our results demonstrate that in many cases the limited improvement is due to the objective discrepancy caused by class similarity.
    Google ScholarLocate open access versionFindings
  • In (Chen et al., 2019), they observe a baseline of learning a new FC layer outperforms meta-learning methods when domain difference gets larger (e.g. novel classes from a different dataset), while they hypothesize this is because further adaptation by fine-tuning is important for domain shift. However, we observe the advantage of the pre-trained classifier still exists even without fine-tuning, that the pretrained representations may have stronger transferability when compared to meta-learning from scratch.
    Google ScholarFindings
  • We also compare the effect of pre-training with replacing the last linear-classifier with cosine nearest-neighbor metric as proposed in (Gidaris & Komodakis, 2018; Chen et al., 2019), the results are shown in Table 11, where Cosine denotes pretraining with cosine metric and Linear denotes the standard pre-training. On miniImageNet, we observe that Cosine outperforms Linear in 1-shot, but has worse performance in 5-shot. On tieredImageNet, we observe Linear outperforms Cosine in both 1-shot and 5-shot.
    Google ScholarFindings
  • The ResNet-12 backbone consists of 4 residual blocks that each residual block has 3 convolutional layers. Each convolutional layer has a 3 × 3 kernel, followed by Batch Normalization (Ioffe & Szegedy, 2015) and Leaky ReLU (Xu et al., 2015) with 0.1 slope. The channels of convolutional layers in each residual block is 64, 128, 256, 512 respectively, a 2 × 2 max-pooling layer is applied after each residual block. Images are resized to 80 × 80 before feeding into the network, that finally a 5 × 5 global average pooling is applied to get a 512-dimensional feature vector.
    Google ScholarFindings
  • Besides miniImageNet and tieredImageNet, we also observe the objective discrepancy in the large scale dataset ImageNet-800, the base class generalization and novel class generalization performance are plotted in Figure 4. With both backbones of ResNet-18 and ResNet-50, the base class difference (default, reported (Chen et al., 2019)) (default, re-implemented)
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments