Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
CoRR, 2014.
EI
Weibo:
Abstract:
Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding...More
Code:
Data:
Introduction
- Generating descriptions for images has long been regarded as a challenging perception task integrating vision, learning and language understanding.
- One needs to correctly recognize what appears in images and incorporate knowledge of spatial relationships and interactions between objects.
- Even with this information, one needs to generate a description that is relevant and grammatically correct.
- Systems that can describe images well, could in principle, be fine-tuned to answer questions about images
Highlights
- Generating descriptions for images has long been regarded as a challenging perception task integrating vision, learning and language understanding
- With the recent advances made in deep neural networks, tasks such as object recognition and detection have made significant breakthroughs in only a short time
- We show that using a linear sentence encoder, linguistic regularities [12] carry over to multimodal vector spaces
- We review log-bilinear neural language models [29], multiplicative neural language models [30] and introduce our structure-content neural language model
- We describe the structure-content neural language model
- Integrating object detections into our framework should almost surely improve performance as well as allow for interpretable retrievals, as in the case of DeFrag
Methods
- 3.1 Image-sentence ranking
The authors' main quantitative results is to establish the effectiveness of using an LSTM sentence encoder for ranking image and descriptions. - The deep visual semantic embedding model [5] was proposed as a way of performing zeroshot object recognition and was used as a baseline by [15].
- In this model, sentences are represented as the mean of their word embeddings and the objective function optimized matches ours
Results
- For some metrics the authors outperform or match existing results while on others m-RNN outperforms the model.
- The m-RNN does not learn an explicit embedding between images and sentences and relies on perplexity as a means of retrieval.
- The authors' model (OxfordNet) learn explicit embedding spaces have a significant speed advantage over perplexity-based retrieval methods, since retrieval is done with a single matrix multiply of stored embedding vectors from the dataset with the query vector.
- Explicit embedding methods are much better suited for scaling to large datasets
Conclusion
- It is often the case that only a small region is relevant at any given time.
- The authors plan on experimenting with LSTM decoders as well as deep and bidirectional LSTM encoders
Summary
Introduction:
Generating descriptions for images has long been regarded as a challenging perception task integrating vision, learning and language understanding.- One needs to correctly recognize what appears in images and incorporate knowledge of spatial relationships and interactions between objects.
- Even with this information, one needs to generate a description that is relevant and grammatically correct.
- Systems that can describe images well, could in principle, be fine-tuned to answer questions about images
Objectives:
The authors' goal is to translate an image into a description. Given an embedding u, the goal is to model the distribution P from previous word context w1:n−1 and forward structure context tn:n+k, where k is the forward context size.Methods:
3.1 Image-sentence ranking
The authors' main quantitative results is to establish the effectiveness of using an LSTM sentence encoder for ranking image and descriptions.- The deep visual semantic embedding model [5] was proposed as a way of performing zeroshot object recognition and was used as a baseline by [15].
- In this model, sentences are represented as the mean of their word embeddings and the objective function optimized matches ours
Results:
For some metrics the authors outperform or match existing results while on others m-RNN outperforms the model.- The m-RNN does not learn an explicit embedding between images and sentences and relies on perplexity as a means of retrieval.
- The authors' model (OxfordNet) learn explicit embedding spaces have a significant speed advantage over perplexity-based retrieval methods, since retrieval is done with a single matrix multiply of stored embedding vectors from the dataset with the query vector.
- Explicit embedding methods are much better suited for scaling to large datasets
Conclusion:
It is often the case that only a small region is relevant at any given time.- The authors plan on experimenting with LSTM decoders as well as deep and bidirectional LSTM encoders
Tables
- Table1: Flickr8K experiments. R@K is Recall@K (high is good). Med r is the median rank (low is good). Best results overall are bold while best results without OxfordNet features are underlined. A † infront of the method indicates that object detections were used along with single frame features
- Table2: Flickr30K experiments. R@K is Recall@K (high is good). Med r is the median rank (low is good). Best results overall are bold while best results without OxfordNet features are underlined. A † infront of the method indicates that object detections were used along with single frame features
Funding
- Integrating object detections into our framework should almost surely improve performance as well as allow for interpretable retrievals, as in the case of DeFrag
Reference
- Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997.
- Ryan Kiros, Richard S Zemel, and Ruslan Salakhutdinov. Multimodal neural language models. ICML, 2014.
- Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 2013.
- Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, 2010.
- Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, and Tomas Mikolov MarcAurelio Ranzato. Devise: A deep visual-semantic embedding model. NIPS, 2013.
- Richard Socher, Q Le, C Manning, and A Ng. Grounded compositional semantics for finding and describing images with sentences. In TACL, 2014.
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014.
- Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In EMNLP, 2013.
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. NIPS, 2014.
- Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In NAACL-HLT, 2013.
- Karl Moritz Hermann and Phil Blunsom. Multilingual distributed representations without word alignment. ICLR, 2014.
- Karl Moritz Hermann and Phil Blunsom. Multilingual models for compositional distributional semantics. In ACL, 2014.
- Andrej Karpathy, Armand Joulin, and Li Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NIPS, 2014.
- Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, 2012.
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Ng. Multimodal deep learning. In ICML, 2011.
- Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. Learning cross-modality similarity for multinomial data. In ICCV, 2011.
- Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV. 2014.
- Phil Blunsom, Nando de Freitas, Edward Grefenstette, Karl Moritz Hermann, et al. A deep architecture for semantic parsing. In ACL 2014 Workshop on Semantic Parsing, 2014.
- Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011.
- Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. Every picture tells a story: Generating sentences from images. In ECCV. 2010.
- Siming Li, Girish Kulkarni, Tamara L Berg, Alexander C Berg, and Yejin Choi. Composing simple image descriptions using web-scale n-grams. In CONLL, 2011.
- Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. Corpus-guided sentence generation of natural images. In EMNLP, 2011.
- Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. Midge: Generating image descriptions from computer vision detections. In EACL, 2012.
- Polina Kuznetsova, Vicente Ordonez, Alexander C Berg, Tamara L Berg, and Yejin Choi. Collective generation of natural image descriptions. ACL, 2012.
- Polina Kuznetsova, Vicente Ordonez, Tamara L. Berg, and Yejin Choi. Treetalk: Composition and compression of trees for image descriptions. TACL, 2014.
- Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. Translating video content to natural language descriptions. In ICCV, 2013.
- Andriy Mnih and Geoffrey Hinton. Three new graphical models for statistical language modelling. In ICML, pages 641–648, 2007.
- Ryan Kiros, Richard S Zemel, and Ruslan Salakhutdinov. A multiplicative model for learning distributed text-based attribute representations. NIPS, 2014.
- Alex Graves, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and Jürgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. TPAMI, 2009.
- Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with deep bidirectional lstm. In IEEE Workshop on ASRU, 2013.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.
- Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
- Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Roland Memisevic and Geoffrey Hinton. Unsupervised learning of image transformations. In CVPR, pages 1–8, 2007.
- Alex Krizhevsky, Geoffrey E Hinton, et al. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, pages 621–628, 2010.
- Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011.
- Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. Fast and robust neural network joint models for statistical machine translation. ACL, 2014.
- Peter Young Alice Lai Micah Hodosh and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2014.
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR, 2014.
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. JMLR, 2003.
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312, 2014.
Full Text
Tags
Comments