Shaping Visual Representations with Language for Few-shot Classification

ACL 2020, 2020.

Cited by: 1|Bibtex|Views83
EI
Other Links: dblp.uni-trier.de|arxiv.org|academic.microsoft.com
Weibo:
Our models can operate without language at test time: a more practical setting, since it is often unrealistic to assume that linguistic supervision is available for unseen classes encountered in the wild

Abstract:

Language is designed to convey useful information about the world, thus serving as a scaffold for efficient human learning. How can we let language guide representation learning in machine learning models? We explore this question in the setting of few-shot visual classification, proposing models which learn to perform visual classifica...More
0
Introduction
  • Humans are powerful and data-efficient learners partially due to the ability to learn from language [6, 30]: for instance, the authors can learn about robins not by seeing thousands of examples, but by being told that a robin is a bird with a red belly and brown feathers.
  • Compared to meta-learning baselines and recent approaches which use language supervision as a more fundamental bottleneck in a model [1], the authors find this simple auxiliary training objective results in learned representations that generalize better to new concepts
Highlights
  • Humans are powerful and data-efficient learners partially due to the ability to learn from language [6, 30]: for instance, we can learn about robins not by seeing thousands of examples, but by being told that a robin is a bird with a red belly and brown feathers
  • Our models can operate without language at test time: a more practical setting, since it is often unrealistic to assume that linguistic supervision is available for unseen classes encountered in the wild
  • To identify which aspects of language are most helpful for the model, we examine language-shaped learning performance under ablated language supervision: (1) keeping only a list of common color words, (2) filtering out color words, (3) shuffling the words in each caption, and (4) shuffling the captions across tasks (Figure 3)
  • We find that while the benefits of color/no-color language varies for ShapeWorld and Birds, neither component is completely sufficient for the full benefit of language supervision, demonstrating that language-shaped learning is able to leverage both colors and other attributes exposed through language
  • We presented a method for regularizing a few-shot visual recognition model by forcing the model to predict natural language descriptions during training
  • The language-influenced representations learned with such models improved generalization over those without linguistic supervision
Methods
  • The authors use the ShapeWorld [20] dataset devised by [1], which consists of 9000 training, 1000 validation, and 4000 test tasks (Figure 2).3.
  • Each task contains a single support set of K = 4 images representing a visual concept with an associated English language description, generated with a minimal recursion semantics representation of the concept [7].
  • The task is to predict whether a single query image x belongs to the concept
Results
  • For ShapeWorld, LSL outperforms its meta-learning baseline (Meta) by 6.7 points, and does as well as L3.
  • For Birds, the authors observe a smaller but still significant 3.3 point increase over Meta, while L3’s performance drops below baseline.
  • The authors find that LSL is the superior yet conceptually simpler model, and L3’s discrete bottleneck can hurt in some settings.
  • When the captions are shuffled and the linguistic signal is random, LSL for Birds suffers no
Conclusion
  • The authors presented a method for regularizing a few-shot visual recognition model by forcing the model to predict natural language descriptions during training.
  • Unlike attributes and annotator rationales, language is (1) a more natural medium for annotators, (2) does not require preconceived restrictions on the kinds of features relevant to the task, and (3) is abundant in unsupervised forms.
  • This last point suggests the authors can shape representations with language from external resources, a promising future direction of work
Summary
  • Introduction:

    Humans are powerful and data-efficient learners partially due to the ability to learn from language [6, 30]: for instance, the authors can learn about robins not by seeing thousands of examples, but by being told that a robin is a bird with a red belly and brown feathers.
  • Compared to meta-learning baselines and recent approaches which use language supervision as a more fundamental bottleneck in a model [1], the authors find this simple auxiliary training objective results in learned representations that generalize better to new concepts
  • Methods:

    The authors use the ShapeWorld [20] dataset devised by [1], which consists of 9000 training, 1000 validation, and 4000 test tasks (Figure 2).3.
  • Each task contains a single support set of K = 4 images representing a visual concept with an associated English language description, generated with a minimal recursion semantics representation of the concept [7].
  • The task is to predict whether a single query image x belongs to the concept
  • Results:

    For ShapeWorld, LSL outperforms its meta-learning baseline (Meta) by 6.7 points, and does as well as L3.
  • For Birds, the authors observe a smaller but still significant 3.3 point increase over Meta, while L3’s performance drops below baseline.
  • The authors find that LSL is the superior yet conceptually simpler model, and L3’s discrete bottleneck can hurt in some settings.
  • When the captions are shuffled and the linguistic signal is random, LSL for Birds suffers no
  • Conclusion:

    The authors presented a method for regularizing a few-shot visual recognition model by forcing the model to predict natural language descriptions during training.
  • Unlike attributes and annotator rationales, language is (1) a more natural medium for annotators, (2) does not require preconceived restrictions on the kinds of features relevant to the task, and (3) is abundant in unsupervised forms.
  • This last point suggests the authors can shape representations with language from external resources, a promising future direction of work
Tables
  • Table1: Model test accuracies (± 95% CI) across 1000 (ShapeWorld) and 600 (Birds) tasks
Download tables as Excel
Related work
  • Language has been shown to assist visual classification in various settings, including traditional visual classification with no transfer [16] and with language available at test time in the form of class labels or descriptions for zero- [10, 11, 27] or few-shot [24, 33] learning. Unlike this work, we study a setting where we have no language at test time and test tasks are unseen, so language from training can no longer be used as additional class information [cf. 16] or weak supervision for labeling additional in-domain data [cf. 15]. Our work can thus be seen as an instance of the learning using privileged information (LUPI) [31] framework, where richer supervision augments a model during training only.

    In this framework, learning with attributes and other domain-specific rationales has been tackled extensively [8, 9, 29], but language remains relatively unexplored. [13] use METEOR scores between captions as a similarity metric for specializing embeddings for image retrieval, but do not directly

    33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

    Meta (Snell et al, 2017)

    fθ Support c True LSL (ours)

    Auxilgiaφry training (discard at test)

    LSTMDec a red cross is below a square

    L3 (Andreas et al, 2018)

    fθ Support fθ Query fθ Query gφ LSTMDec
Funding
  • This work was supported by an NSF Graduate Research Fellowship for JM, a SAIL-Toyota Research Award, and the Office of Naval Research grant ONR MURI N00014-16-1-2007
Reference
  • J. Andreas, D. Klein, and S. Levine. Learning with latent language. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2166–2179, 2018.
    Google ScholarLocate open access versionFindings
  • D. Bahdanau, F. Hill, J. Leike, E. Hughes, A. Hosseini, P. Kohli, and E. Grefenstette. Learning to understand goal specifications by modelling reward. In International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • O.-M. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom. e-snli: natural language inference with natural language explanations. In Advances in Neural Information Processing Systems (NeurIPS), pages 9539–9549, 2018.
    Google ScholarLocate open access versionFindings
  • W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang. A closer look at few-shot classification. In International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014.
    Google ScholarLocate open access versionFindings
  • S. Chopra, M. H. Tessler, and N. D. Goodman. The first crank of the cultural ratchet: Learning and transmitting concepts through language. In Proceedings of the 41st Annual Meeting of the Cognitive Science Society, pages 226–232, 2019.
    Google ScholarLocate open access versionFindings
  • A. A. Copestake, G. Emerson, M. W. Goodman, M. Horvat, A. Kuhnle, and E. Muszynska. Resources for building applications with dependency minimal recursion semantics. In International Conference on Language Resources and Evaluation (LREC), 2016.
    Google ScholarLocate open access versionFindings
  • J. Deng, J. Krause, and L. Fei-Fei. Fine-grained crowdsourcing for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2013.
    Google ScholarLocate open access versionFindings
  • J. Donahue and K. Grauman. Annotator rationales for visual recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1395–1402, 2011.
    Google ScholarLocate open access versionFindings
  • M. Elhoseiny, B. Saleh, and A. Elgammal. Write a classifier: Zero-shot learning using purely textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2584–2591, 2013.
    Google ScholarLocate open access versionFindings
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, R. Marc’Aurelio, and T. Mikolov. DeViSE: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NeurIPS), pages 2121–2129, 2013.
    Google ScholarLocate open access versionFindings
  • N. Goodman. Fact, fiction, and forecast. Harvard University Press, Cambridge, MA, 1955.
    Google ScholarFindings
  • A. Gordo and D. Larlus. Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6589–6598, 2017.
    Google ScholarLocate open access versionFindings
  • P. Goyal, S. Niekum, and R. J. Mooney. Using natural language for reward shaping in reinforcement learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 2385–2391, 7 2019.
    Google ScholarLocate open access versionFindings
  • B. Hancock, P. Varma, S. Wang, M. Bringmann, P. Liang, and C. Ré. Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1884–1895, 2018.
    Google ScholarLocate open access versionFindings
  • X. He and Y. Peng. Fine-grained image classification via combining vision and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5994–6002, 2017.
    Google ScholarLocate open access versionFindings
  • L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell. Generating visual explanations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2016.
    Google ScholarLocate open access versionFindings
  • L. A. Hendricks, R. Hu, T. Darrell, and Z. Akata. Grounding visual explanations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 264–279, 2018.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • A. Kuhnle and A. Copestake. Shapeworld-a new test methodology for multimodal language understanding. arXiv preprint arXiv:1704.04517, 2017.
    Findings
  • J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
    Google ScholarLocate open access versionFindings
  • N. F. Rajani, B. McCann, C. Xiong, and R. Socher. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4932–4942, Florence, Italy, July 2019.
    Google ScholarLocate open access versionFindings
  • S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 49–58, 2016.
    Google ScholarLocate open access versionFindings
  • E. Schwartz, L. Karlinsky, R. Feris, R. Giryes, and A. M. Bronstein. Baby steps towards few-shot learning with multiple semantics. arXiv preprint arXiv:1906.01905, 2019.
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 4077–4087, 2017.
    Google ScholarLocate open access versionFindings
  • R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems (NeurIPS), pages 935–943, 2013.
    Google ScholarLocate open access versionFindings
  • S. Srivastava, I. Labutov, and T. Mitchell. Joint concept learning and semantic parsing from natural language explanations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1527–1536, 2017.
    Google ScholarLocate open access versionFindings
  • P. Tokmakov, Y.-X. Wang, and M. Hebert. Learning compositional representations for few-shot recognition. arXiv preprint arXiv:1812.09213, 2018.
    Findings
  • M. Tomasello. The Cultural Origins of Human Cognition. Harvard University Press, Cambridge, MA, 1999.
    Google ScholarFindings
  • V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information. Neural networks, 22(5-6):544–557, 2009.
    Google ScholarLocate open access versionFindings
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 dataset. 2011.
    Google ScholarLocate open access versionFindings
  • C. Xing, N. Rostamzadeh, B. N. Oreshkin, and P. O. Pinheiro. Adaptive cross-modal few-shot learning. arXiv preprint arXiv:1902.07104, 2019. Our code is publicly available at https://github.com/jayelm/lsl.
    Findings
Full Text
Your rating :
0

 

Tags
Comments