Order-Embeddings of Images and Language

international conference on learning representations, 2015.

Cited by: 326|Bibtex|Views110
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
We introduced a simple method to encode order into learned distributed representations, which allows us to explicitly model the partial order structure of the visual-semantic hierarchy

Abstract:

Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, ...More

Code:

Data:

0
Introduction
  • Computer vision and natural language processing are becoming increasingly intertwined.
  • Recent methods for natural language processing such as Young et al (2014) learn the semantics of language by grounding it in the visual world.
  • It is akin to the hypernym relation between words, and textual entailment among phrases: captions are abstractions of images.
  • All three relations can be seen as special cases of a partial order over images and language, illustrated in Figure 1, which the authors refer to as the visualsemantic hierarchy.
  • The authors' goal in this work is to learn representations that respect this partial order structure
Highlights
  • Computer vision and natural language processing are becoming increasingly intertwined
  • All three relations can be seen as special cases of a partial order over images and language, illustrated in Figure 1, which we refer to as the visualsemantic hierarchy
  • We introduced a simple method to encode order into learned distributed representations, which allows us to explicitly model the partial order structure of the visual-semantic hierarchy
  • Our method can be integrated into existing relational learning methods, as we demonstrated on three challenging tasks involving computer vision and natural language processing
  • Previous approaches, including Frome et al (2013) and Norouzi et al (2014) have embedded words and images into a shared semantic space with symmetric similarity—which our experiments suggest to be a poor fit with the partial order structure of WordNet
  • We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval
  • Order-embeddings may enable learning the entire semantic hierarchy in a single model which jointly reasons about hypernymy, entailment, and the relationship between perception and language, unifying what have been until now almost independent lines of work
Methods
  • The caption-image retrieval task has become a standard evaluation of joint models of vision and language (Hodosh et al, 2013; Lin et al, 2014a).
  • While Karpathy & Li (2015) and Plummer et al (2015) model a finer-grained alignment between regions in the image and segments of the caption, the similarity they use is still symmetric.
  • An alternative is to learn an unconstrained binary relation, either with a neural language model conditioned on the image (Vinyals et al, 2015; Mao et al, 2015) or using a multimodal CNN (Ma et al, 2015)
Results
  • Since the setup is novel, there are no published numbers to compare to. The authors compare three variants of the model to two baselines, with results shown in Table 1.

    The transitive closure baseline involves no learning; it classifies hypernyms pairs as positive if they are in the transitive closure of the union of edges in the training and validation sets.
  • The word2gauss baseline evaluates the approach of Vilnis & McCallum (2015) to represent words as Gaussian densities rather than points in the embedding space.
  • This allows a natural representation of hierarchies using the KL divergence.
Conclusion
  • CONCLUSION AND FUTURE WORK

    The authors introduced a simple method to encode order into learned distributed representations, which allows them to explicitly model the partial order structure of the visual-semantic hierarchy.
  • The authors' method can be integrated into existing relational learning methods, as the authors demonstrated on three challenging tasks involving computer vision and natural language processing.
  • On two of these tasks, hypernym prediction and caption-image retrieval, the methods outperform all previous work.
  • Order-embeddings may enable learning the entire semantic hierarchy in a single model which jointly reasons about hypernymy, entailment, and the relationship between perception and language, unifying what have been until now almost independent lines of work
Summary
  • Introduction:

    Computer vision and natural language processing are becoming increasingly intertwined.
  • Recent methods for natural language processing such as Young et al (2014) learn the semantics of language by grounding it in the visual world.
  • It is akin to the hypernym relation between words, and textual entailment among phrases: captions are abstractions of images.
  • All three relations can be seen as special cases of a partial order over images and language, illustrated in Figure 1, which the authors refer to as the visualsemantic hierarchy.
  • The authors' goal in this work is to learn representations that respect this partial order structure
  • Objectives:

    The authors' goal is to predict whether an unseen pair (u , v ) is ordered. The authors aim to find an approximate order-embedding: a mapping which violates the order-embedding condition, imposed as a soft constraint, as little as possible.
  • Methods:

    The caption-image retrieval task has become a standard evaluation of joint models of vision and language (Hodosh et al, 2013; Lin et al, 2014a).
  • While Karpathy & Li (2015) and Plummer et al (2015) model a finer-grained alignment between regions in the image and segments of the caption, the similarity they use is still symmetric.
  • An alternative is to learn an unconstrained binary relation, either with a neural language model conditioned on the image (Vinyals et al, 2015; Mao et al, 2015) or using a multimodal CNN (Ma et al, 2015)
  • Results:

    Since the setup is novel, there are no published numbers to compare to. The authors compare three variants of the model to two baselines, with results shown in Table 1.

    The transitive closure baseline involves no learning; it classifies hypernyms pairs as positive if they are in the transitive closure of the union of edges in the training and validation sets.
  • The word2gauss baseline evaluates the approach of Vilnis & McCallum (2015) to represent words as Gaussian densities rather than points in the embedding space.
  • This allows a natural representation of hierarchies using the KL divergence.
  • Conclusion:

    CONCLUSION AND FUTURE WORK

    The authors introduced a simple method to encode order into learned distributed representations, which allows them to explicitly model the partial order structure of the visual-semantic hierarchy.
  • The authors' method can be integrated into existing relational learning methods, as the authors demonstrated on three challenging tasks involving computer vision and natural language processing.
  • On two of these tasks, hypernym prediction and caption-image retrieval, the methods outperform all previous work.
  • Order-embeddings may enable learning the entire semantic hierarchy in a single model which jointly reasons about hypernymy, entailment, and the relationship between perception and language, unifying what have been until now almost independent lines of work
Tables
  • Table1: Binary classification accuracy on 4000 withheld edges from WordNet
  • Table2: Results of caption-image retrieval evaluation on COCO. R@K is Recall@K, in %. Med r is median rank. Metrics for our models on 1k test images are averages over five 1000-image splits of the 5000-image test set, as in (<a class="ref-link" id="cKlein_et+al_2015_a" href="#rKlein_et+al_2015_a">Klein et al, 2015</a>). Best results overall are in bold; best results using 1-crop VGG features are underlined
  • Table3: Test accuracy (%) on SNLI
Download tables as Excel
Funding
  • The work was supported in part by an NSERC Graduate Scholarship
Study subjects and analysis
false hypernym pairs: 500
3.3 DETAILS OF TRAINING. We learn a 50-dimensional nonnegative vector for each concept in WordNet using the max-margin objective (4) with margin α = 1, sampling 500 true and 500 false hypernym pairs in each batch. We train for 30-50 epochs using the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.01 and early stopping on the validation set

random image-caption pairs: 128
To train the model, we use the standard pairwise ranking objective from Eq (5). We sample minibatches of 128 random image-caption pairs, and draw all contrastive terms from the minibatch, giving us 127 contrastive images for each caption and captions for each image. We train for 15-30 epochs using the Adam optimizer with learning rate 0.001, and early stopping on the validation set

pairs with the biggest length difference: 100
Order-embeddings don’t have this problem: the less detailed caption can be embedded very far away from the image while remaining above it in the partial order. To evaluate this intuition, we use caption length as a proxy for level of detail and select, among pairs of co-referring captions in our validation set, the 100 pairs with the biggest length difference. For image retrieval with 1000 target images, the mean rank over captions in this set is 6.4 for orderembeddings and 9.7 for cosine similarity, a much bigger difference than over the entire dataset

pairs: 570000
5.2 DATASET. To evaluate order-embeddings on the natural language inference task, we use the recently proposed SNLI corpus (Bowman et al, 2015), which contains 570,000 pairs of sentences, each labeled with “entailment” if the inference is valid, “contradiction” if the two sentences contradict, or “neutral” if the inference is invalid but there is no contradiction. Our method only allows us to discriminate between entailment and non-entailment, so we merge the “contradiction” and “neutral” classes together to serve as our negative examples

sentence pairs: 128
Just as for caption-image ranking, we set the dimensions of the embedding space and GRU hidden state to be 1024, the dimension of the word embeddings to be 300, and constrain the embeddings to have unit L2 norm. We train for 10 epochs with batches of 128 sentence pairs. We use the Adam optimizer with learning rate 0.001 and early stopping on the validation set

Reference
  • Baroni, Marco, Bernardi, Raffaella, Do, Ngoc-Quynh, and Shan, Chung-chieh. Entailment above the word level in distributional semantics. In EACL, 2012.
    Google ScholarLocate open access versionFindings
  • Bordes, Antoine, Weston, Jason, Collobert, Ronan, and Bengio, Yoshua. Learning structured embeddings of knowledge bases. In AAAI, 2011.
    Google ScholarLocate open access versionFindings
  • Bowman, Samuel R., Angeli, Gabor, Potts, Christopher, and Manning, Christopher D. A large annotated corpus for learning natural language inference. In EMNLP, 2015.
    Google ScholarLocate open access versionFindings
  • Cho, Kyunghyun, Van Merrienboer, Bart, Gulcehre, Caglar, Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using rnn encoderdecoder for statistical machine translation. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Chopra, Sumit, Hadsell, Raia, and LeCun, Yann. Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2005.
    Google ScholarLocate open access versionFindings
  • Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, Samy, Dean, Jeff, Mikolov, Tomas, et al. Devise: A deep visual-semantic embedding model. In NIPS, 2013.
    Google ScholarFindings
  • He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 2013.
    Google ScholarLocate open access versionFindings
  • Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Karpathy, Andrej and Li, Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
    Findings
  • Kiros, Ryan, Zhu, Yukun, Salakhutdinov, Ruslan, Zemel, Richard S, Torralba, Antonio, Urtasun, Raquel, and Fidler, Sanja. Skip-thought vectors. NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Klein, Benjamin, Lev, Guy, Sadeh, Gil, and Wolf, Lior. Associating neural word embeddings with deep image representations using fisher vectors. In CVPR, 2015.
    Google ScholarFindings
  • Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Lin, Dahua, Fidler, Sanja, Kong, Chen, and Urtasun, Raquel. Visual semantic search: Retrieving videos via complex textual queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014a.
    Google ScholarLocate open access versionFindings
  • Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollar, Piotr, and Zitnick, C Lawrence. Microsoft coco: Common objects in context. In ECCV, 2014b.
    Google ScholarLocate open access versionFindings
  • Ma, Lin, Lu, Zhengdong, Shang, Lifeng, and Li, Hang. Multimodal convolutional neural networks for matching image and sentence. ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, Alan. Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Mikolov, Tomas, Yih, Wen-tau, and Zweig, Geoffrey. Linguistic regularities in continuous space word representations. In HLT-NAACL, pp. 746–751, 2013.
    Google ScholarLocate open access versionFindings
  • Miller, George A. Wordnet: a lexical database for english. Communications of the ACM, 1995.
    Google ScholarLocate open access versionFindings
  • Norouzi, Mohammad, Mikolov, Tomas, Bengio, Samy, Singer, Yoram, Shlens, Jonathon, Frome, Andrea, Corrado, Greg S, and Dean, Jeffrey. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Plummer, Bryan, Wang, Liwei, Cervantes, Chris, Caicedo, Juan, Hockenmaier, Julia, and Lazebnik, Svetlana. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-tosentence models. arXiv preprint arXiv:1505.04870, 2015.
    Findings
  • Rocktaschel, Tim, Grefenstette, Edward, Hermann, Karl Moritz, Kocisky, Tomas, and Blunsom, Phil. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664, 2015.
    Findings
  • Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Socher, Richard, Chen, Danqi, Manning, Christopher D, and Ng, Andrew. Reasoning with neural tensor networks for knowledge base completion. In NIPS, 2013.
    Google ScholarFindings
  • Socher, Richard, Karpathy, Andrej, Le, Quoc V, Manning, Christopher D, and Ng, Andrew Y. Grounded compositional semantics for finding and describing images with sentences. TACL, 2014. Van der Maaten, Laurens and Hinton, Geoffrey. Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579-2605):85, 2008.
    Google ScholarLocate open access versionFindings
  • Vilnis, Luke and McCallum, Andrew. Word representations via gaussian embedding. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2014.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments