Learning Two-Branch Neural Networks for Image-Text Matching Tasks

IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 394-407, 2018.

Cited by: 201|Bibtex|Views141|DOI:https://doi.org/10.1109/TPAMI.2018.2797921
EI WOS
Other Links: dblp.uni-trier.de|pubmed.ncbi.nlm.nih.gov|academic.microsoft.com
Weibo:
While we argued in Section 5.1 that the similarity network is poorly suited for this task, it is remarkable just how low its numbers are, especially given its competitive accuracy on the phrase localization task

Abstract:

Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates t...More

Code:

Data:

0
Introduction
  • C OMPUTER vision is moving from predicting discrete, categorical labels to generating rich descriptions of visual data, in particular, in the form of natural language.
  • Visual grounding tasks like referring expression understanding [10], [11] and phrase localization [12] find image regions indicated by questions, phrases, or sentences
  • To support these tasks, a number of large-scale datasets and benchmarks have recently been proposed, including MSCOCO [13] and Flickr30K [14] datasets for image captioning, Flickr30K Entities [12] for phrase localization, the Visual Genome dataset [15] for localized textual description of images, and the VQA dataset [7] for question answering
Highlights
  • C OMPUTER vision is moving from predicting discrete, categorical labels to generating rich descriptions of visual data, in particular, in the form of natural language
  • Visual grounding tasks like referring expression understanding [10], [11] and phrase localization [12] find image regions indicated by questions, phrases, or sentences
  • We propose state-of-the-art embedding and similarity networks for learning the correspondence between image and text data for two tasks: phrase localization and bi-directional image-sentence search
  • We focus on two image-text tasks: phrase localization and image-sentence retrieval
  • The network merges the output of the two branches using element-wise product, followed by a series of fully connected and Rectified Linear Unit layers
  • While we argued in Section 5.1 that the similarity network is poorly suited for this task, it is remarkable just how low its numbers are, especially given its competitive accuracy on the phrase localization task
Methods
  • Mean vector [2] CCA (FV HGLMM) [2] CCA (FV GMM+HGLMM) [2] DVSA [4] m-RNN-vgg [33] mCNN [63] LayerNorm [65] OrderEmbedding [66] Two-way Nets [18].
  • It can be seen that adding neighborhood constraints on top of neighborhood sampling provides a more convincing gain for within-view retrieval than for cross-view retrieval
  • This behavior can be useful for practical multi-media systems where both tasks are required at the same time
Results
  • Results on the

    Flickr30K and MSCOCO datasets are given in Tables 2 and 3, respectively.
  • The most relevant baseline for the embedding network is CCA (HGLMM) [2], [12], since it uses the same underlying feature representations for images and sentences.
  • Parts (b) of the tables give results for the embedding networks, and the trends are largely similar to those of Table 1.
  • In Table 3(b), adding neighborhood constraints improves the R@1 in both directions but shows a small drop for R@10.
  • The authors will show in Section 5.5 that adding neighborhood constraints can consistently improve withinview retrieval
Conclusion
  • The authors' embedding and similarity networks are comparable in terms of recall, but they have different advantages and disadvantages.
  • The embedding network, works by explicitly learning a non-linear mapping from input image and text features into a joint latent space in which corresponding image and text features have high similarity.
  • This network works well for both image-sentence and region-phrase tasks, though its objective consists of multiple terms and relies on somewhat costly and intricate triplet sampling.
  • The authors' preliminary unsuccessful experiments on combining imagesentence and region-phrase models indicate an important direction for future research
Summary
  • Introduction:

    C OMPUTER vision is moving from predicting discrete, categorical labels to generating rich descriptions of visual data, in particular, in the form of natural language.
  • Visual grounding tasks like referring expression understanding [10], [11] and phrase localization [12] find image regions indicated by questions, phrases, or sentences
  • To support these tasks, a number of large-scale datasets and benchmarks have recently been proposed, including MSCOCO [13] and Flickr30K [14] datasets for image captioning, Flickr30K Entities [12] for phrase localization, the Visual Genome dataset [15] for localized textual description of images, and the VQA dataset [7] for question answering
  • Methods:

    Mean vector [2] CCA (FV HGLMM) [2] CCA (FV GMM+HGLMM) [2] DVSA [4] m-RNN-vgg [33] mCNN [63] LayerNorm [65] OrderEmbedding [66] Two-way Nets [18].
  • It can be seen that adding neighborhood constraints on top of neighborhood sampling provides a more convincing gain for within-view retrieval than for cross-view retrieval
  • This behavior can be useful for practical multi-media systems where both tasks are required at the same time
  • Results:

    Results on the

    Flickr30K and MSCOCO datasets are given in Tables 2 and 3, respectively.
  • The most relevant baseline for the embedding network is CCA (HGLMM) [2], [12], since it uses the same underlying feature representations for images and sentences.
  • Parts (b) of the tables give results for the embedding networks, and the trends are largely similar to those of Table 1.
  • In Table 3(b), adding neighborhood constraints improves the R@1 in both directions but shows a small drop for R@10.
  • The authors will show in Section 5.5 that adding neighborhood constraints can consistently improve withinview retrieval
  • Conclusion:

    The authors' embedding and similarity networks are comparable in terms of recall, but they have different advantages and disadvantages.
  • The embedding network, works by explicitly learning a non-linear mapping from input image and text features into a joint latent space in which corresponding image and text features have high similarity.
  • This network works well for both image-sentence and region-phrase tasks, though its objective consists of multiple terms and relies on somewhat costly and intricate triplet sampling.
  • The authors' preliminary unsuccessful experiments on combining imagesentence and region-phrase models indicate an important direction for future research
Tables
  • Table1: b) gives the ablation study results for the embedding network. Using a nonlinearity within each branch improves performance, and using bi-directional instead of single-directional loss function doesn’t give too much difference, since phrase localization emphasizes more on the single direction (phrase-to-region). Therefore, given the limited space, we only list those combinations with bi-directional loss in Table 1(b). Adding positive region augmentation improves R@1 by almost 5%. Neighborhood sampling, that uses at least two positive regions for each query phrase in the mini-batch, gives a further increase of about 2% over standard sampling that only uses one positive region. c) reports the accuracy of the similarity network with and without nonlinearity in each branch, with and without positive region augmentation. Consistent with the embedding network results, the nonlinear models improve R@1 over their linear versions by about 2%, but positive region augmentation gives an even bigger improvement of about 5%. The highest R@1 achieved by the similarity network is 51.05, which is almost identical to the 50.69 or 51.03 achieved by our best embedding networks. We also checked the performance of the similarity network with a different number of FC layers after the element-wise product, though we do not list the complete numbers in Table 1 to avoid clutter. With a single FC layer, we get a significantly lower R@1 of 36.61, and with two FC layers, we get 49.39, which is almost on par with three layers. Phrase localization results on Flickr30K Entities. We use 200 EdgeBox proposals, for which the recall upper bound is R@200 = 84.58. See
  • Table2: Bi-directional retrieval results. The numbers in (a) come from published papers, and the numbers in (b-d) are results of our embedding and similarity networks. Note that the Deep CCA results in [<a class="ref-link" id="c26" href="#r26">26</a>] were obtained with AlexNet [<a class="ref-link" id="c64" href="#r64">64</a>]. The results of our embedding network with AlexNet are still about 3% higher than those of [<a class="ref-link" id="c26" href="#r26">26</a>] for image-to-sentence retrieval and 1% higher for sentence-to-image retrieval
  • Table3: Bi-directional retrieval results on the MSCOCO 1000-image test set
  • Table4: Sentence-to-sentence retrieval on Flickr30K and MSCOCO datasets
  • Table5: Results on Flickr30K image-sentence retrieval with incorporating region-phrase correspondences (see text)
Download tables as Excel
Related work
  • CCA-based methods. One of the most popular baselines for image-text embedding is Canonical Correlation Analysis (CCA), which finds linear projections that maximize the correlation between projected vectors from the two views [21], [22]. Recent works using it include [2], [3], [23]. To obtain a nonlinear embedding, other works have opted for kernel CCA [21], [24], which finds maximally correlated projections in reproducing kernel Hilbert spaces with corresponding kernels. Despite being a classic textbook method, CCA has turned out to be a surprisingly powerful baseline. Klein et al [2] showed that properly normalized CCA [23] with stateof-the-art image and text features can outperform much more complicated models. The main disadvantage of CCA is its high memory cost, as it requires loading all the data into memory to compute the data covariance matrix.
Funding
  • This material is based upon work supported by the National Science Foundation under Grants CIF-1302438 and IIS-1563727, Xerox UAC, and the Sloan Foundation
Reference
  • A. Karpathy, A. Joulin, and F. F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in NIPS, 2014.
    Google ScholarFindings
  • B. Klein, G. Lev, G. Sadeh, and L. Wolf, “Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation,” CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik, “Improving image-sentence embeddings using large weakly annotated photo collections,” in ECCV, 2014.
    Google ScholarFindings
  • A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in CVPR, 2015.
    Google ScholarFindings
  • J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, 2015.
    Google ScholarFindings
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015.
    Google ScholarFindings
  • L. Yu, E. Park, A. C. Berg, and T. L. Berg, “Visual madlibs: Fill in the blank image generation and question answering,” ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • A. Jabri, A. Joulin, and L. van der Maaten, “Revisiting visual question answering baselines,” in ECCV, 2016.
    Google ScholarFindings
  • S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg, “Referitgame: Referring to objects in photographs of natural scenes.” in EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in ECCV, 2016.
    Google ScholarFindings
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-tophrase correspondences for richer image-to-sentence models,” in ICCV, 2015.
    Google ScholarFindings
  • X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
    Findings
  • P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
    Google ScholarLocate open access versionFindings
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” 2016. [Online]. Available: https://arxiv.org/abs/1602.07332
    Findings
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.
    Google ScholarFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
    Google ScholarFindings
  • A. Eisenschtat and L. Wolf, “Linking image and text with 2-way nets,” CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • L. Wang, Y. Li, and S. Lazebnik, “Learning deep structurepreserving image-text embeddings,” CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • L. Yu, H. Tan, M. Bansal, and T. L. Berg, “A joint speaker-listenerreinforcer model for referring expressions,” CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural computation, vol. 16, no. 12, pp. 2639–2664, 2004.
    Google ScholarLocate open access versionFindings
  • H. Hotelling, “Relations between two sets of variables,” Biometrika, vol. 28, p. 312377, 1936.
    Google ScholarLocate open access versionFindings
  • Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view embedding space for modeling internet images, tags, and their semantics,” IJCV, 2014.
    Google ScholarLocate open access versionFindings
  • M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, 2013.
    Google ScholarLocate open access versionFindings
  • G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in ICML, 2013.
    Google ScholarFindings
  • F. Yan and K. Mikolajczyk, “Deep correlation for matching images and text,” in CVPR, 2015.
    Google ScholarFindings
  • Z. Ma, Y. Lu, and D. Foster, “Finding linear structure in large datasets with scalable canonical correlation analysis,” ICML, 2015.
    Google ScholarLocate open access versionFindings
  • J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in ICML, 2011.
    Google ScholarFindings
  • N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” arXiv:1411.4389, 2014.
    Findings
  • R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Multimodal neural language models.” in ICML, 2014.
    Google ScholarFindings
  • R. Kiros, R. Salakhutdinov, and R. Zemel, “Unifying visualsemantic embeddings with multimodal neural language models,” in arXiv preprint arXiv:1411.2539, 2014.
    Findings
  • J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” arXiv:1412.4729, 2014.
    Findings
  • J. Weston, S. Bengio, and N. Usunier, “Wsabie: Scaling up to large vocabulary image annotation,” in IJCAI, 2011.
    Google ScholarLocate open access versionFindings
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov et al., “Devise: A deep visual-semantic embedding model,” in NIPS, 2013.
    Google ScholarFindings
  • R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 207–218, 2014.
    Google ScholarLocate open access versionFindings
  • J. Hu, J. Lu, and Y.-P. Tan, “Discriminative deep metric learning for face verification in the wild,” in CVPR, 2014.
    Google ScholarFindings
  • T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Metric learning for large scale image classification: Generalizing to new classes at near-zero cost,” in ECCV, 2012.
    Google ScholarFindings
  • B. Shaw, B. Huang, and T. Jebara, “Learning a distance metric from a network,” in NIPS, 2011.
    Google ScholarFindings
  • B. Shaw and T. Jebara, “Structure preserving embedding,” in ICML, 2009.
    Google ScholarFindings
  • K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in NIPS, 2005.
    Google ScholarFindings
  • J. Zbontar and Y. LeCun, “Computing the stereo matching cost with a convolutional neural network,” arXiv:1409.4326, 2014.
    Findings
  • J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Sackinger, and R. Shah, “Signature verification using a siamese time delay neural network,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 7, no. 04, pp. 669–688, 1993.
    Google ScholarLocate open access versionFindings
  • S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in CVPR, 2005.
    Google ScholarFindings
  • X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg, “Matchnet: Unifying feature and metric learning for patch-based matching,” in CVPR, 2015.
    Google ScholarFindings
  • E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” in CVPR, 2014.
    Google ScholarFindings
  • J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov, “Predicting deep zero-shot convolutional neural networks using textual descriptions,” ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” arXiv preprint arXiv:1606.01847, 2016.
    Findings
  • A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele, “Grounding of textual phrases in images by reconstruction,” ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” ICML, 2015.
    Google ScholarLocate open access versionFindings
  • T. Joachims, T. Finley, and C.-N. J. Yu, “Cutting-plane training of structural svms,” Machine Learning, vol. 77, no. 1, pp. 27–59, 2009.
    Google ScholarLocate open access versionFindings
  • B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” IJCV, 2016.
    Google ScholarLocate open access versionFindings
  • R. Girshick, “Fast r-cnn,” in ICCV, 2015.
    Google ScholarFindings
  • K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
    Findings
  • M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge 2012,” 2011.
    Google ScholarFindings
  • F. Perronnin, J. Sanchez, and T. Mensink, “Improving the Fisher kernel for large-scale image classification,” in ECCV, 2010.
    Google ScholarFindings
  • M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng, “Structured matching for phrase localization,” in ECCV, 2016.
    Google ScholarFindings
  • D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • L. Ma, Z. Lu, L. Shang, and H. Li, “Multimodal convolutional neural networks for matching image and sentence,” ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, “Order-embeddings of images and language,” ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments