Learning Two-Branch Neural Networks for Image-Text Matching Tasks
IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 394-407, 2018.
EI WOS
Weibo:
Abstract:
Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates t...More
Code:
Data:
Introduction
- C OMPUTER vision is moving from predicting discrete, categorical labels to generating rich descriptions of visual data, in particular, in the form of natural language.
- Visual grounding tasks like referring expression understanding [10], [11] and phrase localization [12] find image regions indicated by questions, phrases, or sentences
- To support these tasks, a number of large-scale datasets and benchmarks have recently been proposed, including MSCOCO [13] and Flickr30K [14] datasets for image captioning, Flickr30K Entities [12] for phrase localization, the Visual Genome dataset [15] for localized textual description of images, and the VQA dataset [7] for question answering
Highlights
- C OMPUTER vision is moving from predicting discrete, categorical labels to generating rich descriptions of visual data, in particular, in the form of natural language
- Visual grounding tasks like referring expression understanding [10], [11] and phrase localization [12] find image regions indicated by questions, phrases, or sentences
- We propose state-of-the-art embedding and similarity networks for learning the correspondence between image and text data for two tasks: phrase localization and bi-directional image-sentence search
- We focus on two image-text tasks: phrase localization and image-sentence retrieval
- The network merges the output of the two branches using element-wise product, followed by a series of fully connected and Rectified Linear Unit layers
- While we argued in Section 5.1 that the similarity network is poorly suited for this task, it is remarkable just how low its numbers are, especially given its competitive accuracy on the phrase localization task
Methods
- Mean vector [2] CCA (FV HGLMM) [2] CCA (FV GMM+HGLMM) [2] DVSA [4] m-RNN-vgg [33] mCNN [63] LayerNorm [65] OrderEmbedding [66] Two-way Nets [18].
- It can be seen that adding neighborhood constraints on top of neighborhood sampling provides a more convincing gain for within-view retrieval than for cross-view retrieval
- This behavior can be useful for practical multi-media systems where both tasks are required at the same time
Results
- Results on the
Flickr30K and MSCOCO datasets are given in Tables 2 and 3, respectively. - The most relevant baseline for the embedding network is CCA (HGLMM) [2], [12], since it uses the same underlying feature representations for images and sentences.
- Parts (b) of the tables give results for the embedding networks, and the trends are largely similar to those of Table 1.
- In Table 3(b), adding neighborhood constraints improves the R@1 in both directions but shows a small drop for R@10.
- The authors will show in Section 5.5 that adding neighborhood constraints can consistently improve withinview retrieval
Conclusion
- The authors' embedding and similarity networks are comparable in terms of recall, but they have different advantages and disadvantages.
- The embedding network, works by explicitly learning a non-linear mapping from input image and text features into a joint latent space in which corresponding image and text features have high similarity.
- This network works well for both image-sentence and region-phrase tasks, though its objective consists of multiple terms and relies on somewhat costly and intricate triplet sampling.
- The authors' preliminary unsuccessful experiments on combining imagesentence and region-phrase models indicate an important direction for future research
Summary
Introduction:
C OMPUTER vision is moving from predicting discrete, categorical labels to generating rich descriptions of visual data, in particular, in the form of natural language.- Visual grounding tasks like referring expression understanding [10], [11] and phrase localization [12] find image regions indicated by questions, phrases, or sentences
- To support these tasks, a number of large-scale datasets and benchmarks have recently been proposed, including MSCOCO [13] and Flickr30K [14] datasets for image captioning, Flickr30K Entities [12] for phrase localization, the Visual Genome dataset [15] for localized textual description of images, and the VQA dataset [7] for question answering
Methods:
Mean vector [2] CCA (FV HGLMM) [2] CCA (FV GMM+HGLMM) [2] DVSA [4] m-RNN-vgg [33] mCNN [63] LayerNorm [65] OrderEmbedding [66] Two-way Nets [18].- It can be seen that adding neighborhood constraints on top of neighborhood sampling provides a more convincing gain for within-view retrieval than for cross-view retrieval
- This behavior can be useful for practical multi-media systems where both tasks are required at the same time
Results:
Results on the
Flickr30K and MSCOCO datasets are given in Tables 2 and 3, respectively.- The most relevant baseline for the embedding network is CCA (HGLMM) [2], [12], since it uses the same underlying feature representations for images and sentences.
- Parts (b) of the tables give results for the embedding networks, and the trends are largely similar to those of Table 1.
- In Table 3(b), adding neighborhood constraints improves the R@1 in both directions but shows a small drop for R@10.
- The authors will show in Section 5.5 that adding neighborhood constraints can consistently improve withinview retrieval
Conclusion:
The authors' embedding and similarity networks are comparable in terms of recall, but they have different advantages and disadvantages.- The embedding network, works by explicitly learning a non-linear mapping from input image and text features into a joint latent space in which corresponding image and text features have high similarity.
- This network works well for both image-sentence and region-phrase tasks, though its objective consists of multiple terms and relies on somewhat costly and intricate triplet sampling.
- The authors' preliminary unsuccessful experiments on combining imagesentence and region-phrase models indicate an important direction for future research
Tables
- Table1: b) gives the ablation study results for the embedding network. Using a nonlinearity within each branch improves performance, and using bi-directional instead of single-directional loss function doesn’t give too much difference, since phrase localization emphasizes more on the single direction (phrase-to-region). Therefore, given the limited space, we only list those combinations with bi-directional loss in Table 1(b). Adding positive region augmentation improves R@1 by almost 5%. Neighborhood sampling, that uses at least two positive regions for each query phrase in the mini-batch, gives a further increase of about 2% over standard sampling that only uses one positive region. c) reports the accuracy of the similarity network with and without nonlinearity in each branch, with and without positive region augmentation. Consistent with the embedding network results, the nonlinear models improve R@1 over their linear versions by about 2%, but positive region augmentation gives an even bigger improvement of about 5%. The highest R@1 achieved by the similarity network is 51.05, which is almost identical to the 50.69 or 51.03 achieved by our best embedding networks. We also checked the performance of the similarity network with a different number of FC layers after the element-wise product, though we do not list the complete numbers in Table 1 to avoid clutter. With a single FC layer, we get a significantly lower R@1 of 36.61, and with two FC layers, we get 49.39, which is almost on par with three layers. Phrase localization results on Flickr30K Entities. We use 200 EdgeBox proposals, for which the recall upper bound is R@200 = 84.58. See
- Table2: Bi-directional retrieval results. The numbers in (a) come from published papers, and the numbers in (b-d) are results of our embedding and similarity networks. Note that the Deep CCA results in [<a class="ref-link" id="c26" href="#r26">26</a>] were obtained with AlexNet [<a class="ref-link" id="c64" href="#r64">64</a>]. The results of our embedding network with AlexNet are still about 3% higher than those of [<a class="ref-link" id="c26" href="#r26">26</a>] for image-to-sentence retrieval and 1% higher for sentence-to-image retrieval
- Table3: Bi-directional retrieval results on the MSCOCO 1000-image test set
- Table4: Sentence-to-sentence retrieval on Flickr30K and MSCOCO datasets
- Table5: Results on Flickr30K image-sentence retrieval with incorporating region-phrase correspondences (see text)
Related work
- CCA-based methods. One of the most popular baselines for image-text embedding is Canonical Correlation Analysis (CCA), which finds linear projections that maximize the correlation between projected vectors from the two views [21], [22]. Recent works using it include [2], [3], [23]. To obtain a nonlinear embedding, other works have opted for kernel CCA [21], [24], which finds maximally correlated projections in reproducing kernel Hilbert spaces with corresponding kernels. Despite being a classic textbook method, CCA has turned out to be a surprisingly powerful baseline. Klein et al [2] showed that properly normalized CCA [23] with stateof-the-art image and text features can outperform much more complicated models. The main disadvantage of CCA is its high memory cost, as it requires loading all the data into memory to compute the data covariance matrix.
Funding
- This material is based upon work supported by the National Science Foundation under Grants CIF-1302438 and IIS-1563727, Xerox UAC, and the Sloan Foundation
Reference
- A. Karpathy, A. Joulin, and F. F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in NIPS, 2014.
- B. Klein, G. Lev, G. Sadeh, and L. Wolf, “Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation,” CVPR, 2015.
- Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik, “Improving image-sentence embeddings using large weakly annotated photo collections,” in ECCV, 2014.
- A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in CVPR, 2015.
- J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” CVPR, 2016.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, 2015.
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015.
- L. Yu, E. Park, A. C. Berg, and T. L. Berg, “Visual madlibs: Fill in the blank image generation and question answering,” ICCV, 2015.
- A. Jabri, A. Joulin, and L. van der Maaten, “Revisiting visual question answering baselines,” in ECCV, 2016.
- S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg, “Referitgame: Referring to objects in photographs of natural scenes.” in EMNLP, 2014.
- L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in ECCV, 2016.
- B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-tophrase correspondences for richer image-to-sentence models,” in ICCV, 2015.
- X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
- R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” 2016. [Online]. Available: https://arxiv.org/abs/1602.07332
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
- A. Eisenschtat and L. Wolf, “Linking image and text with 2-way nets,” CVPR, 2017.
- L. Wang, Y. Li, and S. Lazebnik, “Learning deep structurepreserving image-text embeddings,” CVPR, 2016.
- L. Yu, H. Tan, M. Bansal, and T. L. Berg, “A joint speaker-listenerreinforcer model for referring expressions,” CVPR, 2017.
- D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural computation, vol. 16, no. 12, pp. 2639–2664, 2004.
- H. Hotelling, “Relations between two sets of variables,” Biometrika, vol. 28, p. 312377, 1936.
- Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embedding space for modeling internet images, tags, and their semantics,” IJCV, 2014.
- M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, 2013.
- G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in ICML, 2013.
- F. Yan and K. Mikolajczyk, “Deep correlation for matching images and text,” in CVPR, 2015.
- Z. Ma, Y. Lu, and D. Foster, “Finding linear structure in large datasets with scalable canonical correlation analysis,” ICML, 2015.
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in ICML, 2011.
- N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in NIPS, 2012.
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” arXiv:1411.4389, 2014.
- R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Multimodal neural language models.” in ICML, 2014.
- R. Kiros, R. Salakhutdinov, and R. Zemel, “Unifying visualsemantic embeddings with multimodal neural language models,” in arXiv preprint arXiv:1411.2539, 2014.
- J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” ICLR, 2015.
- S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” arXiv:1412.4729, 2014.
- J. Weston, S. Bengio, and N. Usunier, “Wsabie: Scaling up to large vocabulary image annotation,” in IJCAI, 2011.
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov et al., “Devise: A deep visual-semantic embedding model,” in NIPS, 2013.
- R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 207–218, 2014.
- J. Hu, J. Lu, and Y.-P. Tan, “Discriminative deep metric learning for face verification in the wild,” in CVPR, 2014.
- T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Metric learning for large scale image classification: Generalizing to new classes at near-zero cost,” in ECCV, 2012.
- B. Shaw, B. Huang, and T. Jebara, “Learning a distance metric from a network,” in NIPS, 2011.
- B. Shaw and T. Jebara, “Structure preserving embedding,” in ICML, 2009.
- K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in NIPS, 2005.
- J. Zbontar and Y. LeCun, “Computing the stereo matching cost with a convolutional neural network,” arXiv:1409.4326, 2014.
- J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Sackinger, and R. Shah, “Signature verification using a siamese time delay neural network,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 7, no. 04, pp. 669–688, 1993.
- S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in CVPR, 2005.
- X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg, “Matchnet: Unifying feature and metric learning for patch-based matching,” in CVPR, 2015.
- E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” ICLR, 2015.
- F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” CVPR, 2015.
- J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” in CVPR, 2014.
- J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov, “Predicting deep zero-shot convolutional neural networks using textual descriptions,” ICCV, 2015.
- A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” arXiv preprint arXiv:1606.01847, 2016.
- A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele, “Grounding of textual phrases in images by reconstruction,” ECCV, 2016.
- C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” ICML, 2015.
- T. Joachims, T. Finley, and C.-N. J. Yu, “Cutting-plane training of structural svms,” Machine Learning, vol. 77, no. 1, pp. 27–59, 2009.
- B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” IJCV, 2016.
- R. Girshick, “Fast r-cnn,” in ICCV, 2015.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
- M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge 2012,” 2011.
- F. Perronnin, J. Sanchez, and T. Mensink, “Improving the Fisher kernel for large-scale image classification,” in ECCV, 2010.
- M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng, “Structured matching for phrase localization,” in ECCV, 2016.
- D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- L. Ma, Z. Lu, L. Shang, and H. Li, “Multimodal convolutional neural networks for matching image and sentence,” ICCV, 2015.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in NIPS, 2012.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, “Order-embeddings of images and language,” ICLR, 2016.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Full Text
Tags
Comments