AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Given difficulties in annotating phraseto-object datasets at scale, we develop a Multimodal Alignment Framework to leverage more widely-available caption-image datasets, which can be used as a form of weak supervision

MAF: Multimodal Alignment Framework for Weakly Supervised Phrase Grounding

EMNLP 2020, (2020)

Cited by: 0|Views50
Full Text
Bibtex
Weibo

Abstract

Phrase localization is a task that studies the mapping from textual phrases to regions of an image. Given difficulties in annotating phraseto-object datasets at scale, we develop a Multimodal Alignment Framework (MAF) to leverage more widely-available caption-image datasets, which can then be used as a form of weak supervision. We first p...More

Code:

Data:

0
Introduction
  • Language grounding involves mapping language to real objects or data. Among language grounding tasks, phrase localization—which maps phrases to regions of an image—is a fundamental building block for other tasks.
  • Existing work (Rohrbach et al, 2016; Kim et al, 2018; Li et al, 2019; Yu et al, 2018; Liu et al, 2020) mainly focuses on the supervised phrase localization setting.
  • This requires a large-scale annotated dataset of phrase-object pairs for model training.
  • The widely-adopted Flickr30k (Plummer et al, 2015) dataset has 31k images, while the caption dataset MS COCO (Lin et al, 2014) contains 330k images
Highlights
  • Language grounding involves mapping language to real objects or data
  • We propose a Multimodal Alignment Framework (MAF), which is illustrated in Figure 3
  • We evaluate MAF on the public phrase localization dataset, Flickr30k Entities (Plummer et al, 2015)
  • Our result significantly outperforms the previous best result by 5.56%, which demonstrates the effectiveness of our visually-aware language representations
  • We present a Multimodal Alignment Framework, a novel method with fine-grained visual and textual representations for phrase localization, and we train it under a weakly-supervised setting, using a contrastive objective to guide the alignment between visual and textual representations
  • Our MAF with ResNet-101-based Faster R-CNN detector pretrained on Visual Genome (VG) (Krishna et al, 2017) can achieve an accuracy of 61.43%
  • We evaluate our model on Flickr30k Entities and achieve substantial improvements over the previous state-of-the-art methods with both weakly-supervised and unsupervised training strategies
Methods
Results
  • The Flickr30k Entities dataset contains 224k phrases and 31k images in total, where each image will be associated with 5 captions and multiple localized bounding boxes.
  • The authors re-implemented Wang and Specia (2019) with a Faster R-CNN model trained on Visual Genome (Krishna et al, 2017).
  • This achieves 49.72% accuracy.
  • The authors can infer that attributes cannot provide much information in localization (24.08% accuracy if used alone), partly because attributes are not frequently used to differentiate objects in Flickr30k captions
Conclusion
  • The authors present a Multimodal Alignment Framework, a novel method with fine-grained visual and textual representations for phrase localization, and the authors train it under a weakly-supervised setting, using a contrastive objective to guide the alignment between visual and textual representations.
  • The authors evaluate the model on Flickr30k Entities and achieve substantial improvements over the previous state-of-the-art methods with both weakly-supervised and unsupervised training strategies.
  • Detailed analysis is provided to help future works investigate other critical feature enrichment and alignment methods for this task
Summary
  • Introduction:

    Language grounding involves mapping language to real objects or data. Among language grounding tasks, phrase localization—which maps phrases to regions of an image—is a fundamental building block for other tasks.
  • Existing work (Rohrbach et al, 2016; Kim et al, 2018; Li et al, 2019; Yu et al, 2018; Liu et al, 2020) mainly focuses on the supervised phrase localization setting.
  • This requires a large-scale annotated dataset of phrase-object pairs for model training.
  • The widely-adopted Flickr30k (Plummer et al, 2015) dataset has 31k images, while the caption dataset MS COCO (Lin et al, 2014) contains 330k images
  • Methods:

    Vis. Features Acc.
  • Supervised GroundeR (Rohrbach et al, 2016) CCA (Plummer et al, 2015) BAN (Kim et al, 2018) visualBERT (Li et al, 2019) DDPN (Yu et al, 2018) CGN (Liu et al, 2020).
  • VGGdet VGGdet ResNet-101 ResNet-101 ResNet-101 ResNet-101.
  • Weakly-Supervised GroundeR (Rohrbach et al, 2016) Link (Yeh et al, 2018) KAC (Chen et al, 2018)
  • Results:

    The Flickr30k Entities dataset contains 224k phrases and 31k images in total, where each image will be associated with 5 captions and multiple localized bounding boxes.
  • The authors re-implemented Wang and Specia (2019) with a Faster R-CNN model trained on Visual Genome (Krishna et al, 2017).
  • This achieves 49.72% accuracy.
  • The authors can infer that attributes cannot provide much information in localization (24.08% accuracy if used alone), partly because attributes are not frequently used to differentiate objects in Flickr30k captions
  • Conclusion:

    The authors present a Multimodal Alignment Framework, a novel method with fine-grained visual and textual representations for phrase localization, and the authors train it under a weakly-supervised setting, using a contrastive objective to guide the alignment between visual and textual representations.
  • The authors evaluate the model on Flickr30k Entities and achieve substantial improvements over the previous state-of-the-art methods with both weakly-supervised and unsupervised training strategies.
  • Detailed analysis is provided to help future works investigate other critical feature enrichment and alignment methods for this task
Tables
  • Table1: Weakly-supervised experiment results on Flick30k Entities. (We abbreviate backbone visual feature model as “Vis. Feature,” and upper bound as “UB.”)
  • Table2: Unsupervised experiment results on Flick30k Entities. w2v-max refers to the similarity algorithm proposed in (<a class="ref-link" id="cWang_2019_a" href="#rWang_2019_a">Wang and Specia, 2019</a>); Glove-att refers to our unsupervised inference strategy in Section 3.2; CC, OI, and PL stand for detectors trained on MS COCO (<a class="ref-link" id="cLin_et+al_2014_a" href="#rLin_et+al_2014_a">Lin et al, 2014</a>), Open Image (<a class="ref-link" id="cKrasin_et+al_2017_a" href="#rKrasin_et+al_2017_a">Krasin et al, 2017</a>), and Places (<a class="ref-link" id="cZhou_et+al_2017_a" href="#rZhou_et+al_2017_a">Zhou et al, 2017</a>)
  • Table3: Ablation experiment results of different visual and textual features. TFR and VFR denotes textual and visual feature representation respectively
  • Table4: Ablation results of different initialization. (ZR: zero initialization; RD: random initialization; ID+RD: noisy identity initialization.)
  • Table5: Baseline results of unsupervised methods on Flick30k Entities. Abbreviations are explained above
Download tables as Excel
Related work
Funding
  • HT is supported by Bloomberg Data Science Ph.D
  • The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the funding agency
Reference
  • Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. 2019. Multi-level multimodal common semantic space for image-phrase grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12476–12486.
    Google ScholarLocate open access versionFindings
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018a. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086.
    Google ScholarLocate open access versionFindings
  • Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018b. Visionand-language navigation: Interpreting visuallygrounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683.
    Google ScholarLocate open access versionFindings
  • Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. 2019. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12538–12547.
    Google ScholarLocate open access versionFindings
  • Kan Chen, Jiyang Gao, and Ram Nevatia. 2018. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4042–4050.
    Google ScholarLocate open access versionFindings
  • Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, and Ajay Divakaran. 2019. Align2Ground: Weakly supervised phrase grounding guided by image-caption alignment. In Proceedings of the IEEE International Conference on Computer Vision, pages 2601–2610.
    Google ScholarLocate open access versionFindings
  • Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge.
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778.
    Google ScholarLocate open access versionFindings
  • Jack Hessel, Lillian Lee, and David Mimno. 201Unsupervised discovery of multimodal links in multiimage, multi-sentence documents. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2034–2045.
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy and Li Fei-Fei. 2015. Deep visualsemantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137.
    Google ScholarLocate open access versionFindings
  • Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems, pages 1564–1574.
    Google ScholarLocate open access versionFindings
  • Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. 2017. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages.
    Locate open access versionFindings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73.
    Google ScholarLocate open access versionFindings
  • Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
    Findings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
    Google ScholarLocate open access versionFindings
  • Yongfei Liu, Bo Wan, Xiaodan Zhu, and Xuming He. 2020. Learning cross-modal context graph for visual grounding. In Proceedings of the AAAI Conference on Artificial Intelligenc.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k Entities: Collecting region-to-phrase correspondences for richer imageto-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99.
    Google ScholarLocate open access versionFindings
  • Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision, pages 817–834. Springer.
    Google ScholarLocate open access versionFindings
  • Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5103–5114.
    Google ScholarLocate open access versionFindings
  • Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2020. Vision-and-dialog navigation. In Conference on Robot Learning, pages 394– 406.
    Google ScholarLocate open access versionFindings
  • Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164.
    Google ScholarLocate open access versionFindings
  • Josiah Wang and Lucia Specia. 2019. Phrase localization without paired training examples. In Proceedings of the IEEE International Conference on Computer Vision, pages 4663–4672.
    Google ScholarLocate open access versionFindings
  • Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. 2017. Weakly-supervised visual grounding of phrases with linguistic structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5945–5954.
    Google ScholarLocate open access versionFindings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015.
    Google ScholarFindings
  • Raymond A Yeh, Minh N Do, and Alexander G Schwing. 2018. Unsupervised textual grounding: Linking words to image concepts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6125–6134.
    Google ScholarLocate open access versionFindings
  • Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. 2018. Rethinking diversified and discriminative proposal generation for visual grounding. International Joint Conference on Artificial Intelligence (IJCAI).
    Google ScholarLocate open access versionFindings
  • Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464.
    Google ScholarLocate open access versionFindings
  • For GloVE word embeddings, we use the one with the hidden dimension 300. Phrases are split into words by space. We replace all out-of-vocabulary words with the introduced UNK token. For object proposals, we apply an off-the-shelf Faster RCNN model (Ren et al., 2015) as the object detector4 for object pseudo-labels. The backbone of the detector is ResNet-101 (He et al., 2016), and it is pre-trained on Visual Genome with mAP=10.1. We keep all bounding boxes with a confidence score larger than 0.1. For ResNet-based visual features, we use the 2048-dimensional feature from Bottomup attention (Anderson et al., 2018a), which is pretrained with 1600 object labels and 400 attributes.
    Google ScholarLocate open access versionFindings
Author
Qinxin Wang
Qinxin Wang
Hao Tan
Hao Tan
Sheng Shen
Sheng Shen
Your rating :
0

 

Tags
Comments
小科