Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

CoRR, 2014.

Cited by: 850|Bibtex|Views136
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
With the recent advances made in deep neural networks, tasks such as object recognition and detection have made significant breakthroughs in only a short time

Abstract:

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding...More

Code:

Data:

0
Introduction
  • Generating descriptions for images has long been regarded as a challenging perception task integrating vision, learning and language understanding.
  • One needs to correctly recognize what appears in images and incorporate knowledge of spatial relationships and interactions between objects.
  • Even with this information, one needs to generate a description that is relevant and grammatically correct.
  • Systems that can describe images well, could in principle, be fine-tuned to answer questions about images
Highlights
  • Generating descriptions for images has long been regarded as a challenging perception task integrating vision, learning and language understanding
  • With the recent advances made in deep neural networks, tasks such as object recognition and detection have made significant breakthroughs in only a short time
  • We show that using a linear sentence encoder, linguistic regularities [12] carry over to multimodal vector spaces
  • We review log-bilinear neural language models [29], multiplicative neural language models [30] and introduce our structure-content neural language model
  • We describe the structure-content neural language model
  • Integrating object detections into our framework should almost surely improve performance as well as allow for interpretable retrievals, as in the case of DeFrag
Methods
  • 3.1 Image-sentence ranking

    The authors' main quantitative results is to establish the effectiveness of using an LSTM sentence encoder for ranking image and descriptions.
  • The deep visual semantic embedding model [5] was proposed as a way of performing zeroshot object recognition and was used as a baseline by [15].
  • In this model, sentences are represented as the mean of their word embeddings and the objective function optimized matches ours
Results
  • For some metrics the authors outperform or match existing results while on others m-RNN outperforms the model.
  • The m-RNN does not learn an explicit embedding between images and sentences and relies on perplexity as a means of retrieval.
  • The authors' model (OxfordNet) learn explicit embedding spaces have a significant speed advantage over perplexity-based retrieval methods, since retrieval is done with a single matrix multiply of stored embedding vectors from the dataset with the query vector.
  • Explicit embedding methods are much better suited for scaling to large datasets
Conclusion
  • It is often the case that only a small region is relevant at any given time.
  • The authors plan on experimenting with LSTM decoders as well as deep and bidirectional LSTM encoders
Summary
  • Introduction:

    Generating descriptions for images has long been regarded as a challenging perception task integrating vision, learning and language understanding.
  • One needs to correctly recognize what appears in images and incorporate knowledge of spatial relationships and interactions between objects.
  • Even with this information, one needs to generate a description that is relevant and grammatically correct.
  • Systems that can describe images well, could in principle, be fine-tuned to answer questions about images
  • Objectives:

    The authors' goal is to translate an image into a description. Given an embedding u, the goal is to model the distribution P from previous word context w1:n−1 and forward structure context tn:n+k, where k is the forward context size.
  • Methods:

    3.1 Image-sentence ranking

    The authors' main quantitative results is to establish the effectiveness of using an LSTM sentence encoder for ranking image and descriptions.
  • The deep visual semantic embedding model [5] was proposed as a way of performing zeroshot object recognition and was used as a baseline by [15].
  • In this model, sentences are represented as the mean of their word embeddings and the objective function optimized matches ours
  • Results:

    For some metrics the authors outperform or match existing results while on others m-RNN outperforms the model.
  • The m-RNN does not learn an explicit embedding between images and sentences and relies on perplexity as a means of retrieval.
  • The authors' model (OxfordNet) learn explicit embedding spaces have a significant speed advantage over perplexity-based retrieval methods, since retrieval is done with a single matrix multiply of stored embedding vectors from the dataset with the query vector.
  • Explicit embedding methods are much better suited for scaling to large datasets
  • Conclusion:

    It is often the case that only a small region is relevant at any given time.
  • The authors plan on experimenting with LSTM decoders as well as deep and bidirectional LSTM encoders
Tables
  • Table1: Flickr8K experiments. R@K is Recall@K (high is good). Med r is the median rank (low is good). Best results overall are bold while best results without OxfordNet features are underlined. A † infront of the method indicates that object detections were used along with single frame features
  • Table2: Flickr30K experiments. R@K is Recall@K (high is good). Med r is the median rank (low is good). Best results overall are bold while best results without OxfordNet features are underlined. A † infront of the method indicates that object detections were used along with single frame features
Download tables as Excel
Funding
  • Integrating object detections into our framework should almost surely improve performance as well as allow for interpretable retrievals, as in the case of DeFrag
Reference
  • Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997.
    Google ScholarLocate open access versionFindings
  • Ryan Kiros, Richard S Zemel, and Ruslan Salakhutdinov. Multimodal neural language models. ICML, 2014.
    Google ScholarFindings
  • Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 2013.
    Google ScholarLocate open access versionFindings
  • Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, 2010.
    Google ScholarFindings
  • Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, and Tomas Mikolov MarcAurelio Ranzato. Devise: A deep visual-semantic embedding model. NIPS, 2013.
    Google ScholarLocate open access versionFindings
  • Richard Socher, Q Le, C Manning, and A Ng. Grounded compositional semantics for finding and describing images with sentences. In TACL, 2014.
    Google ScholarLocate open access versionFindings
  • Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014.
    Findings
  • Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In EMNLP, 2013.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. NIPS, 2014.
    Google ScholarFindings
  • Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In NAACL-HLT, 2013.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann and Phil Blunsom. Multilingual distributed representations without word alignment. ICLR, 2014.
    Google ScholarFindings
  • Karl Moritz Hermann and Phil Blunsom. Multilingual models for compositional distributional semantics. In ACL, 2014.
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy, Armand Joulin, and Li Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Ng. Multimodal deep learning. In ICML, 2011.
    Google ScholarLocate open access versionFindings
  • Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. Learning cross-modality similarity for multinomial data. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV. 2014.
    Google ScholarLocate open access versionFindings
  • Phil Blunsom, Nando de Freitas, Edward Grefenstette, Karl Moritz Hermann, et al. A deep architecture for semantic parsing. In ACL 2014 Workshop on Semantic Parsing, 2014.
    Google ScholarLocate open access versionFindings
  • Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. Every picture tells a story: Generating sentences from images. In ECCV. 2010.
    Google ScholarLocate open access versionFindings
  • Siming Li, Girish Kulkarni, Tamara L Berg, Alexander C Berg, and Yejin Choi. Composing simple image descriptions using web-scale n-grams. In CONLL, 2011.
    Google ScholarLocate open access versionFindings
  • Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. Corpus-guided sentence generation of natural images. In EMNLP, 2011.
    Google ScholarLocate open access versionFindings
  • Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. Midge: Generating image descriptions from computer vision detections. In EACL, 2012.
    Google ScholarLocate open access versionFindings
  • Polina Kuznetsova, Vicente Ordonez, Alexander C Berg, Tamara L Berg, and Yejin Choi. Collective generation of natural image descriptions. ACL, 2012.
    Google ScholarLocate open access versionFindings
  • Polina Kuznetsova, Vicente Ordonez, Tamara L. Berg, and Yejin Choi. Treetalk: Composition and compression of trees for image descriptions. TACL, 2014.
    Google ScholarFindings
  • Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. Translating video content to natural language descriptions. In ICCV, 2013.
    Google ScholarLocate open access versionFindings
  • Andriy Mnih and Geoffrey Hinton. Three new graphical models for statistical language modelling. In ICML, pages 641–648, 2007.
    Google ScholarLocate open access versionFindings
  • Ryan Kiros, Richard S Zemel, and Ruslan Salakhutdinov. A multiplicative model for learning distributed text-based attribute representations. NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Alex Graves, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and Jürgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. TPAMI, 2009.
    Google ScholarLocate open access versionFindings
  • Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
    Findings
  • Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with deep bidirectional lstm. In IEEE Workshop on ASRU, 2013.
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 2014.
    Google ScholarLocate open access versionFindings
  • Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
    Findings
  • Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • Roland Memisevic and Geoffrey Hinton. Unsupervised learning of image transformations. In CVPR, pages 1–8, 2007.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Geoffrey E Hinton, et al. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, pages 621–628, 2010.
    Google ScholarLocate open access versionFindings
  • Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. Fast and robust neural network joint models for statistical machine translation. ACL, 2014.
    Google ScholarLocate open access versionFindings
  • Peter Young Alice Lai Micah Hodosh and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2014.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. JMLR, 2003.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312, 2014.
    Findings
Full Text
Your rating :
0

 

Tags
Comments