On First-Order Meta-Learning Algorithms

Alex Nichol
Alex Nichol
Joshua Achiam
Joshua Achiam

arXiv: Learning, Volume abs/1803.02999, 2018.

Cited by: 269|Bibtex|Views28
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
This paper proposed a new algorithm called Reptile, whose training process is only subtlely different from joint training and only uses first-order gradient information

Abstract:

This paper considers meta-learning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i.e., learns quickly) when presented with a previously unseen task sampled from this distribution. We analyze a family of algorithms for learning a parameter initialization that can be fine-tuned qu...More

Code:

Data:

0
Introduction
  • While machine learning systems have surpassed humans at many tasks, they generally need far more data to reach the same level of performance.
  • It is not completely fair to compare humans to algorithms learning from scratch, since humans enter the task with a large amount of prior knowledge, encoded in their brains and DNA.
  • In practice, it is challenging to develop Bayesian machine learning algorithms that make use of deep neural networks and are computationally feasible
Highlights
  • While machine learning systems have surpassed humans at many tasks, they generally need far more data to reach the same level of performance
  • We introduce Reptile, an algorithm closely related to first-order MAML (FOMAML), which is simple to implement
  • We provide a theoretical analysis that applies to both first-order MAML and Reptile, showing that they both optimize for within-task generalization
  • We find that separate-tail FOMAML is significantly better than shared-tail FOMAML
  • This paper proposed a new algorithm called Reptile, whose training process is only subtlely different from joint training and only uses first-order gradient information
  • While this paper studies the meta-learning setting, the Taylor series analysis in Section 5.1 may have some bearing on stochastic gradient descent in general
Methods
  • The authors evaluate the method on two popular few-shot classification tasks: Omniglot [11] and MiniImageNet [18].
  • If the authors are doing K-shot, N -way classification, the authors sample tasks by selecting N classes from C and selecting K + 1 examples for each class
  • The authors split these examples into a training set and a test set, where the test set contains a single example for each class.
  • If you trained a model for 5-shot, 5-way classification, you would show it 25 examples (5 per class) and ask it to classify a 26th example
Conclusion
  • Meta-learning algorithms that perform gradient descent at test time are appealing because of their simplicity and generalization properties [5].
  • By approximating the update with a Taylor series, the authors showed that SGD automatically gives them the same kind of second-order term that MAML computes.
  • This term adjusts the initial weights to maximize the dot product between the gradients of different minibatches on the same task—i.e., it encourages the gradients to generalize between minibatches of the same task.
  • The authors provided a second informal argument, which is that Reptile finds a point that is close to all of the optimal solution manifolds of the training tasks
Summary
  • Introduction:

    While machine learning systems have surpassed humans at many tasks, they generally need far more data to reach the same level of performance.
  • It is not completely fair to compare humans to algorithms learning from scratch, since humans enter the task with a large amount of prior knowledge, encoded in their brains and DNA.
  • In practice, it is challenging to develop Bayesian machine learning algorithms that make use of deep neural networks and are computationally feasible
  • Methods:

    The authors evaluate the method on two popular few-shot classification tasks: Omniglot [11] and MiniImageNet [18].
  • If the authors are doing K-shot, N -way classification, the authors sample tasks by selecting N classes from C and selecting K + 1 examples for each class
  • The authors split these examples into a training set and a test set, where the test set contains a single example for each class.
  • If you trained a model for 5-shot, 5-way classification, you would show it 25 examples (5 per class) and ask it to classify a 26th example
  • Conclusion:

    Meta-learning algorithms that perform gradient descent at test time are appealing because of their simplicity and generalization properties [5].
  • By approximating the update with a Taylor series, the authors showed that SGD automatically gives them the same kind of second-order term that MAML computes.
  • This term adjusts the initial weights to maximize the dot product between the gradients of different minibatches on the same task—i.e., it encourages the gradients to generalize between minibatches of the same task.
  • The authors provided a second informal argument, which is that Reptile finds a point that is close to all of the optimal solution manifolds of the training tasks
Tables
  • Table1: Results on Mini-ImageNet. Both MAML and 1st-order MAML results are from [<a class="ref-link" id="c4" href="#r4">4</a>]
  • Table2: Results on Omniglot. MAML results are from [<a class="ref-link" id="c4" href="#r4">4</a>]. 1st-order MAML results were generated by the code for [<a class="ref-link" id="c4" href="#r4">4</a>] with the same hyper-parameters as MAML
  • Table3: Reptile hyper-parameters for the Omniglot comparison between all algorithms
  • Table4: Reptile hyper-parameters for the Mini-ImageNet comparison between all algorithms
  • Table5: Hyper-parameters for Section 6.2. All outer step sizes were linearly annealed to zero during training
  • Table6: Hyper-parameters Section 6.3. All outer step sizes were linearly annealed to zero during training
Download tables as Excel
Reference
  • Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
    Findings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
    Findings
  • Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. arXiv preprint arXiv:1710.11622, 2017.
    Findings
  • Nikolaus Hansen. The CMA evolution strategy: a comparing review. In Towards a new evolutionary computation, pages 75–102.
    Google ScholarLocate open access versionFindings
  • Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186, 1987.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87–94.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
    Findings
  • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum. One shot learning of simple visual concepts. In Conference of the Cognitive Science Society (CogSci), 2011.
    Google ScholarLocate open access versionFindings
  • Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
    Google ScholarLocate open access versionFindings
  • Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), 2017.
    Google ScholarLocate open access versionFindings
  • Scott Reed, Yutian Chen, Thomas Paine, Aaron van den Oord, SM Eslami, Danilo Rezende, Oriol Vinyals, and Nando de Freitas. Few-shot autoregressive density estimation: Towards learning to learn distributions. arXiv preprint arXiv:1710.10304, 2017.
    Findings
  • Ruslan Salakhutdinov, Joshua Tenenbaum, and Antonio Torralba. One-shot learning with a hierarchical nonparametric bayesian model. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pages 195–206, 2012.
    Google ScholarLocate open access versionFindings
  • Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016.
    Google ScholarLocate open access versionFindings
  • Lauren A Schmidt. Meaning and compositionality as statistical induction of categories and constraints. PhD thesis, Massachusetts Institute of Technology, 2009.
    Google ScholarFindings
  • Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
    Google ScholarLocate open access versionFindings
  • Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
    Findings
  • Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. Part-based R-CNNs for fine-grained category detection. In European conference on computer vision, pages 834–849.
    Google ScholarLocate open access versionFindings
  • Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments