Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

International Conference on Machine Learning, 2015.

Cited by: 5364|Bibtex|Views358
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|dl.acm.org|arxiv.org
Weibo:
We propose an attention based approach that gives state of the art performance on three benchmark datasets using the BLEU and METEOR metric

Abstract:

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We a...More

Code:

Data:

0
Introduction
  • Generating captions of an image is a task very close to the heart of scene understanding — one of the primary goals of computer vision.
  • Not only must caption generation models be powerful enough to solve the computer vision challenges of determining which objects are in an image, but they must be capable of capturing and expressing their relationships in a natural language.
  • For this reason, caption generation has long been viewed as a difficult problem.
  • Aided by advances in training neural networks (Krizhevsky et al, 2012) and large classification datasets (Russakovsky et al, 2014), recent work
Highlights
  • Automatically generating captions of an image is a task very close to the heart of scene understanding — one of the primary goals of computer vision
  • Not only must caption generation models be powerful enough to solve the computer vision challenges of determining which objects are in an image, but they must be capable of capturing and expressing their relationships in a natural language
  • We describe approaches to caption generation that attempt to incorporate a form of attention with
  • Encouraged by recent advances in caption generation and inspired by recent success in employing attention in machine translation (Bahdanau et al, 2014) and object recognition (Ba et al, 2014; Mnih et al, 2014), we investigate models that can attend to salient part of an image while generating its caption
  • We report results with the frequently used BLEU metric2 which is the standard in the caption generation literature
  • We propose an attention based approach that gives state of the art performance on three benchmark datasets using the BLEU and METEOR metric
Methods
  • The authors describe the experimental methodology and quantitative results which validate the effectiveness of the model for caption generation. 5.1.
  • The authors describe the experimental methodology and quantitative results which validate the effectiveness of the model for caption generation.
  • The authors report results on the popular Flickr8k and Flickr30k dataset which has 8,000 and 30,000 images respectively as well as the more challenging Microsoft COCO dataset which has 82,783 images.
  • The Flickr8k/Flickr30k dataset both come with 5 reference sentences per image, but for the MS COCO dataset, some of the images have references in excess of 5 which for consistency across the datasets the authors discard.
  • The authors used a fixed vocabulary size of 10,000
Results
  • Evaluation Procedures

    A few challenges exist for comparison, which the authors explain here. The first is a difference in choice of convolutional feature extractor.
  • For identical decoder architectures, using more recent architectures such as GoogLeNet or Oxford VGG Szegedy et al (2014), Simonyan & Zisserman (2014) can give a boost in performance over using the AlexNet (Krizhevsky et al, 2012).
  • The authors compare directly only with results which use the comparable GoogLeNet/Oxford VGG features, but for METEOR comparison the authors note some results that use AlexNet. The second challenge is a single model versus ensemble comparison.
  • While other methods have reported performance boosts by using ensembling, in the results the authors report a single model performance
Conclusion
  • The authors propose an attention based approach that gives state of the art performance on three benchmark datasets using the BLEU and METEOR metric.
  • The authors show how the learned attention can be exploited to give more interpretability into the models generation process, and demonstrate that the learned alignments correspond very well to human intuition.
  • The authors hope that the results of this paper will encourage future work in using visual attention.
  • The authors expect that the modularity of the encoder-decoder approach combined with attention to have useful applications in other domains
Summary
  • Introduction:

    Generating captions of an image is a task very close to the heart of scene understanding — one of the primary goals of computer vision.
  • Not only must caption generation models be powerful enough to solve the computer vision challenges of determining which objects are in an image, but they must be capable of capturing and expressing their relationships in a natural language.
  • For this reason, caption generation has long been viewed as a difficult problem.
  • Aided by advances in training neural networks (Krizhevsky et al, 2012) and large classification datasets (Russakovsky et al, 2014), recent work
  • Methods:

    The authors describe the experimental methodology and quantitative results which validate the effectiveness of the model for caption generation. 5.1.
  • The authors describe the experimental methodology and quantitative results which validate the effectiveness of the model for caption generation.
  • The authors report results on the popular Flickr8k and Flickr30k dataset which has 8,000 and 30,000 images respectively as well as the more challenging Microsoft COCO dataset which has 82,783 images.
  • The Flickr8k/Flickr30k dataset both come with 5 reference sentences per image, but for the MS COCO dataset, some of the images have references in excess of 5 which for consistency across the datasets the authors discard.
  • The authors used a fixed vocabulary size of 10,000
  • Results:

    Evaluation Procedures

    A few challenges exist for comparison, which the authors explain here. The first is a difference in choice of convolutional feature extractor.
  • For identical decoder architectures, using more recent architectures such as GoogLeNet or Oxford VGG Szegedy et al (2014), Simonyan & Zisserman (2014) can give a boost in performance over using the AlexNet (Krizhevsky et al, 2012).
  • The authors compare directly only with results which use the comparable GoogLeNet/Oxford VGG features, but for METEOR comparison the authors note some results that use AlexNet. The second challenge is a single model versus ensemble comparison.
  • While other methods have reported performance boosts by using ensembling, in the results the authors report a single model performance
  • Conclusion:

    The authors propose an attention based approach that gives state of the art performance on three benchmark datasets using the BLEU and METEOR metric.
  • The authors show how the learned attention can be exploited to give more interpretability into the models generation process, and demonstrate that the learned alignments correspond very well to human intuition.
  • The authors hope that the results of this paper will encourage future work in using visual attention.
  • The authors expect that the modularity of the encoder-decoder approach combined with attention to have useful applications in other domains
Tables
  • Table1: BLEU-1,2,3,4/METEOR metrics compared to other methods, † indicates a different split, (—) indicates an unknown metric, ◦ indicates the authors kindly provided missing metrics by personal communication, Σ indicates an ensemble, a indicates using AlexNet
Download tables as Excel
Related work
  • In this section we provide relevant background on previous work on image caption generation and attention. Recently, several methods have been proposed for generating image descriptions. Many of these methods are based on recurrent neural networks and inspired by the successful use of sequence to sequence training with neural networks for machine translation (Cho et al, 2014; Bahdanau et al, 2014; Sutskever et al, 2014). One major reason image caption generation is well suited to the encoder-decoder framework (Cho et al, 2014) of machine translation is because it is analogous to “translating” an image to a sentence.

    The first approach to use neural networks for caption generation was Kiros et al (2014a), who proposed a multimodal log-bilinear model that was biased by features from the image. This work was later followed by Kiros et al (2014b) whose method was designed to explicitly allow a natural way of doing both ranking and generation. Mao et al (2014) took a similar approach to generation but replaced a feed-forward neural language model with a recurrent one. Both Vinyals et al (2014) and Donahue et al (2014) use LSTM RNNs for their models. Unlike Kiros et al (2014a) and Mao et al (2014) whose models see the image at each time step of the output word sequence, Vinyals et al (2014) only show the image to the RNN at the beginning. Along
Funding
  • We acknowledge the support of the following organizations for research funding and computing support: the Nuance Foundation, NSERC, Samsung, Calcul Quebec, Compute Canada, the Canada Research Chairs and CIFAR
Reference
  • Ba, Jimmy Lei, Mnih, Volodymyr, and Kavukcuoglu, Koray. Multiple object recognition with visual attention. arXiv:1412.7755, December 2014.
    Findings
  • Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, September 2014.
    Findings
  • Baldi, Pierre and Sadowski, Peter. The dropout learning algorithm. Artificial intelligence, 210:78–122, 2014.
    Google ScholarLocate open access versionFindings
  • Bastien, Frederic, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, Warde-Farley, David, and Bengio, Yoshua. Theano: new features and speed improvements. Submited to the Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
    Google ScholarLocate open access versionFindings
  • Bergstra, James, Breuleux, Olivier, Bastien, Frederic, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.
    Google ScholarLocate open access versionFindings
  • Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.
    Findings
  • Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, October 2014.
    Google ScholarFindings
  • Corbetta, Maurizio and Shulman, Gordon L. Control of goaldirected and stimulus-driven attention in the brain. Nature reviews neuroscience, 3(3):201–215, 2002.
    Google ScholarLocate open access versionFindings
  • Denil, Misha, Bazzani, Loris, Larochelle, Hugo, and de Freitas, Nando. Learning where to attend with deep architectures for image tracking. Neural Computation, 2012.
    Google ScholarLocate open access versionFindings
  • Denkowski, Michael and Lavie, Alon. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014.
    Google ScholarLocate open access versionFindings
  • Donahue, Jeff, Hendrikcs, Lisa Anne, Guadarrama, Segio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389v2, November 2014.
    Findings
  • Elliott, Desmond and Keller, Frank. Image description using visual dependency representations. In EMNLP, 2013.
    Google ScholarFindings
  • Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollar, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv:1411.4952, November 2014.
    Findings
  • Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, pp. 853–899, 2013.
    Google ScholarLocate open access versionFindings
  • Karpathy, Andrej and Li, Fei-Fei. Deep visual-semantic alignments for generating image descriptions. arXiv:1412.2306, December 2014.
    Findings
  • Kingma, Diederik P. and Ba, Jimmy. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, December 2014.
    Findings
  • Kiros, Ryan, Salahutdinov, Ruslan, and Zemel, Richard. Multimodal neural language models. In International Conference on Machine Learning, pp. 595–603, 2014a.
    Google ScholarLocate open access versionFindings
  • Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539, November 2014b.
    Findings
  • Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey. ImageNet classification with deep convolutional neural networks. In NIPS. 2012.
    Google ScholarLocate open access versionFindings
  • Kulkarni, Girish, Premraj, Visruth, Ordonez, Vicente, Dhar, Sagnik, Li, Siming, Choi, Yejin, Berg, Alexander C, and Berg, Tamara L. Babytalk: Understanding and generating simple image descriptions. PAMI, IEEE Transactions on, 35(12):2891– 2903, 2013.
    Google ScholarLocate open access versionFindings
  • Kuznetsova, Polina, Ordonez, Vicente, Berg, Alexander C, Berg, Tamara L, and Choi, Yejin. Collective generation of natural image descriptions. In Association for Computational Linguistics. ACL, 2012.
    Google ScholarLocate open access versionFindings
  • Kuznetsova, Polina, Ordonez, Vicente, Berg, Tamara L, and Choi, Yejin. Treetalk: Composition and compression of trees for image descriptions. TACL, 2(10):351–362, 2014.
    Google ScholarLocate open access versionFindings
  • Larochelle, Hugo and Hinton, Geoffrey E. Learning to combine foveal glimpses with a third-order boltzmann machine. In NIPS, pp. 1243–1251, 2010.
    Google ScholarLocate open access versionFindings
  • Li, Siming, Kulkarni, Girish, Berg, Tamara L, Berg, Alexander C, and Choi, Yejin. Composing simple image descriptions using web-scale n-grams. In Computational Natural Language Learning. ACL, 2011.
    Google ScholarLocate open access versionFindings
  • Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollar, Piotr, and Zitnick, C Lawrence. Microsoft coco: Common objects in context. In ECCV, pp. 740–755. 2014.
    Google ScholarLocate open access versionFindings
  • Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, Alan. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632, December 2014.
    Findings
  • Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch, Alyssa, Goyal, Amit, Berg, Alex, Yamaguchi, Kota, Berg, Tamara, Stratos, Karl, and Daume III, Hal. Midge: Generating image descriptions from computer vision detections. In European Chapter of the Association for Computational Linguistics, pp. 747–756. ACL, 2012.
    Google ScholarLocate open access versionFindings
  • Mnih, Volodymyr, Hees, Nicolas, Graves, Alex, and Kavukcuoglu, Koray. Recurrent models of visual attention. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, and Bengio, Yoshua. How to construct deep recurrent neural networks. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Rensink, Ronald A. The dynamic representation of scenes. Visual cognition, 7(1-3):17–42, 2000.
    Google ScholarLocate open access versionFindings
  • Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.
    Google ScholarFindings
  • Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
    Findings
  • Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning algorithms. In NIPS, pp. 2951–2959, 2012.
    Google ScholarLocate open access versionFindings
  • Snoek, Jasper, Swersky, Kevin, Zemel, Richard S, and Adams, Ryan P. Input warping for bayesian optimization of nonstationary functions. arXiv preprint arXiv:1402.0929, 2014.
    Findings
  • Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15, 2014.
    Google ScholarLocate open access versionFindings
  • Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural networks. In NIPS, pp. 3104– 3112, 2014.
    Google ScholarLocate open access versionFindings
  • Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
    Findings
  • Tang, Yichuan, Srivastava, Nitish, and Salakhutdinov, Ruslan R. Learning generative models with visual attention. In NIPS, pp. 1808–1816, 2014.
    Google ScholarLocate open access versionFindings
  • Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5 - rmsprop. Technical report, 2012.
    Google ScholarFindings
  • Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. arXiv:1411.4555, November 2014.
    Findings
  • Weaver, Lex and Tao, Nigel. The optimal reward baseline for gradient-based reinforcement learning. In Proc. UAI’2001, pp. 538–545, 2001.
    Google ScholarLocate open access versionFindings
  • Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Yang, Yezhou, Teo, Ching Lik, Daume III, Hal, and Aloimonos, Yiannis. Corpus-guided sentence generation of natural images. In EMNLP, pp. 444–454. ACL, 2011.
    Google ScholarLocate open access versionFindings
  • Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78, 2014.
    Google ScholarLocate open access versionFindings
  • Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, September 2014.
    Findings
Full Text
Your rating :
0

 

Tags
Comments