Context-Aware Group Captioning via Self-Attention and Contrastive Features

CVPR, pp. 3437-3447, 2020.

Cited by: 1|Bibtex|Views104|DOI:https://doi.org/10.1109/CVPR42600.2020.00350
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We present the novel context-aware group captioning task, where the objective is to describe a target image group in contrast to a reference image group

Abstract:

While image captioning has progressed rapidly, existing works focus mainly on describing single images. In this paper, we introduce a new task, context-aware group captioning, which aims to describe a group of target images in the context of another group of related reference images. Context-aware group captioning requires not only summ...More

Code:

Data:

0
Introduction
  • Generating natural language descriptions from images, the task commonly known as image captioning, has long been an important problem in computer vision research [3, 16, 32].
  • It requires a high level of understanding from both language and vision.
  • It is often the case that the target image group to be captioned nat-
Highlights
  • Generating natural language descriptions from images, the task commonly known as image captioning, has long been an important problem in computer vision research [3, 16, 32]
  • The objective is to recognize that the user wants woman with cowboy hat and suggest the query. Inspired by these real-world applications, we propose the novel problem of context-aware group captioning: given a group of target images and a group of reference images, our goal is to generate a language description that best describes the target group in the context of the reference group
  • To obtain the feature that effectively summarizes the visual information from the image group, we develop a group-wise feature aggregation module based on self-attention
  • To effectively leverage the contrastive information between the target image group and the reference images, we model the context information as the aggregated feature from the whole set and subtract it from each image group feature to explicitly encourage the resulting feature to capture the differentiating properties between the target image group and the reference image group
  • We present the novel context-aware group captioning task, where the objective is to describe a target image group in contrast to a reference image group
Methods
  • The authors' goal is finding the best Target group
Conclusion
  • In (b), image5 where the woman doing yoga is larger and easier to recognize gets higher attention
  • In both examples, images with more recognizable features get higher attention weights and contribute more to the aggregated group representation.In this paper, the authors present the novel context-aware group captioning task, where the objective is to describe a target image group in contrast to a reference image group.
  • The authors thoroughly analyze the behavior of the models to provide insights into this new problem
Summary
  • Introduction:

    Generating natural language descriptions from images, the task commonly known as image captioning, has long been an important problem in computer vision research [3, 16, 32].
  • It requires a high level of understanding from both language and vision.
  • It is often the case that the target image group to be captioned nat-
  • Methods:

    The authors' goal is finding the best Target group
  • Conclusion:

    In (b), image5 where the woman doing yoga is larger and easier to recognize gets higher attention
  • In both examples, images with more recognizable features get higher attention weights and contribute more to the aggregated group representation.In this paper, the authors present the novel context-aware group captioning task, where the objective is to describe a target image group in contrast to a reference image group.
  • The authors thoroughly analyze the behavior of the models to provide insights into this new problem
Tables
  • Table1: Statistics of Conceptual Captions and Stock Captions, in terms of original per-image captioning dataset and our group captioning dataset constructed on top of per-image captions
  • Table2: Group captioning performance on the Conceptual Captions and Stock Captions dataset
  • Table3: Performance with varying the number of target and reference images. (evaluated on Stock Captions dataset)
  • Table4: Analysis of contrastive representation. Column Contrastive + Group is the prediction of our full model. Column Group and column Contrastive are the predictions when only the group or only the contrastive representation is fed into the decoder respectively. Blue text denotes the common part while red text denotes the contrastive part
  • Table5: Statistics of each caption type on Conceptual Captions and Stock Captions
  • Table6: Performance change when varying the number of reference images on Stock Captions dataset
Download tables as Excel
Related work
  • Image captioning has emerged as an important research topic with a rich literature in computer vision [3, 16, 32].

    With the advances in deep neural networks, state-of-the-art image captioning approaches [1, 13, 19, 21, 39, 42, 53, 60] are based on the combination of convolutional neural networks [26] and recurrent neural networks [15] (CNN-RNN) architecture, where the visual features are extracted from the input image using CNNs which is then decoded by RNNs to generate the language caption describing the given image. Research in image captioning has progressed rapidly in recent years. Novel network architectures [1, 7, 35, 54], loss functions [8, 31, 33, 36, 42, 44], and advanced joint language-vision modeling techniques [20, 23, 35, 58, 59, 61] have been developed to enable more diverse and discriminative captioning results. Recent works have also proposed to leverage the contextual and contrastive information from additional images to help generating more distinctive caption for the target image [2, 6, 9, 51, 17] or comparative descriptions between image pairs [41, 46, 47, 49]. Existing works, however, mostly focus on generating captions for a single image. Our work, on the other hand, focuses on the novel setting of context-based image group captioning which aims to describe a target image group while leveraging the context of a larger pool of reference images.
Funding
  • Introduces a new task, context-aware group captioning, which aims to describe a group of target images in the context of another group of related reference images
  • Proposes a framework combining selfattention mechanism with contrastive feature construction to effectively summarize common information from each image group while capturing discriminative information between them
  • Proposes to group the images and generate the group captions based on single image captions using scene graphs matching
  • Proposes the novel problem of context-aware group captioning: given a group of target images and a group of reference images, our goal is to generate a language description that best describes the target group in the context of the reference group
  • Develops a learningbased framework for context-aware image group captioning based on self-attention and contrastive feature construction
Reference
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Jacob Andreas and Dan Klein. Reasoning about pragmatics with neural listeners and speakers. arXiv preprint arXiv:1604.00562, 2016. 2
    Findings
  • Shuang Bai and Shan An. A survey on automatic image caption generation. Neurocomputing, 311:291–304, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005. 6
    Google ScholarLocate open access versionFindings
  • Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019. 2
    Google ScholarLocate open access versionFindings
  • Fuhai Chen, Rongrong Ji, Xiaoshuai Sun, Yongjian Wu, and Jinsong Su. Groupcap: Group-based image captioning with structured relevance and diversity constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1345–1353, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5659–5667, 2012
    Google ScholarLocate open access versionFindings
  • Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision, pages 2970–2979, 2017. 2
    Google ScholarLocate open access versionFindings
  • Bo Dai and Dahua Lin. Contrastive learning for image captioning. In Advances in Neural Information Processing Systems, pages 898–907, 2017. 2
    Google ScholarLocate open access versionFindings
  • Mostafa Dehghani, Sascha Rothe, Enrique Alfonseca, and Pascal Fleury. Learning to attend, copy, and generate for session-based query suggestion. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1747–1756. ACM, 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 2
    Findings
  • Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015. 2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. 2, 5
    Google ScholarLocate open access versionFindings
  • MD Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR), 51(6):118, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1233–1239, 2016. 2
    Google ScholarLocate open access versionFindings
  • Jyun-Yu Jiang and Wei Wang. Rin: Reformulation inference network for context-aware query suggestion. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 197–206. ACM, 202, 3
    Google ScholarLocate open access versionFindings
  • Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and Tong Zhang. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 499–515, 2018. 2
    Google ScholarLocate open access versionFindings
  • Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4565–4574, 2016. 2
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015. 1, 2
    Google ScholarLocate open access versionFindings
  • Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 2
    Google ScholarLocate open access versionFindings
  • Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So Kweon. Dense relational captioning: Triple-stream networks for relationship-based captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6271–6280, 2019. 2
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 6
    Findings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017. 3
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 2
    Google ScholarLocate open access versionFindings
  • Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019. 2
    Findings
  • Yingwei Li, Xiaojie Jin, Jieru Mei, Xiaochen Lian, Linjie Yang, Cihang Xie, Qihang Yu, Yuyin Zhou, Song Bai, and Alan Yuille. Neural architecture search for lightweight nonlocal networks. In CVPR, 2020. 2
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004. 6
    Google ScholarFindings
  • Chenxi Liu, Junhua Mao, Fei Sha, and Alan Yuille. Attention correctness in neural image captioning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. 2
    Google ScholarLocate open access versionFindings
  • Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE international conference on computer vision, pages 873–881, 2017. 2
    Google ScholarLocate open access versionFindings
  • Xiaoxiao Liu, Qingyang Xu, and Ning Wang. A survey on deep neural network-based image captioning. The Visual Computer, 35(3):445–470, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision (ECCV), pages 338–354, 2018. 2
    Google ScholarLocate open access versionFindings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019. 2
    Findings
  • Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 375–383, 2017. 2
    Google ScholarLocate open access versionFindings
  • Ruotian Luo, Brian Price, Scott Cohen, and Gregory Shakhnarovich. Discriminability objective for training descriptive captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6964– 6974, 2018. 2
    Google ScholarLocate open access versionFindings
  • Ruotian Luo and Gregory Shakhnarovich. Comprehensionguided referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7102–7111, 2017. 2
    Google ScholarLocate open access versionFindings
  • Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016. 2
    Google ScholarLocate open access versionFindings
  • Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. Im2text: Describing images using 1 million captioned photographs. In Advances in neural information processing systems, pages 1143–1151, 2011. 2
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002. 6
    Google ScholarLocate open access versionFindings
  • Dong Huk Park, Trevor Darrell, and Anna Rohrbach. Robust change captioning. In Proceedings of the IEEE International Conference on Computer Vision, pages 4624–4633, 2019. 2
    Google ScholarLocate open access versionFindings
  • Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7008– 7024, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the IEEE International Conference on Computer Vision, pages 4135–4144, 2017. 2
    Google ScholarLocate open access versionFindings
  • Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 553–562. ACM, 2015. 2, 3
    Google ScholarLocate open access versionFindings
  • Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 217–223, 2017. 2
    Google ScholarLocate open access versionFindings
  • Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018. 2
    Findings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019. 2
    Findings
  • Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, and Mohit Bansal. Expressing visual relationships via language. arXiv preprint arXiv:1906.07689, 2019. 2
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. 2, 5, 6
    Google ScholarLocate open access versionFindings
  • Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, and Gal Chechik. Context-aware captions from context-agnostic supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 251–260, 2017. 2
    Google ScholarLocate open access versionFindings
  • Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 6
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. 2
    Google ScholarLocate open access versionFindings
  • Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. Image captioning with deep bidirectional lstms. In Proceedings of the 24th ACM international conference on Multimedia, pages 988–997. ACM, 2016. 2
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018. 2
    Google ScholarLocate open access versionFindings
  • Yu-Siang Wang, Chenxi Liu, Xiaohui Zeng, and Alan Yuille. Scene graph parsing as dependency parsing. arXiv preprint arXiv:1803.09189, 2018. 3
    Findings
  • Bin Wu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. Query suggestion with feedback memory network. In Proceedings of the 2018 World Wide Web Conference, pages 1563–1571. International World Wide Web Conferences Steering Committee, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015. 1, 2, 13
    Google ScholarLocate open access versionFindings
  • Linjie Yang, Kevin Tang, Jianchao Yang, and Li-Jia Li. Dense captioning with joint inference and visual context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2193–2202, 2017. 2
    Google ScholarLocate open access versionFindings
  • Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 684–699, 2018. 2
    Google ScholarLocate open access versionFindings
  • Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651–4659, 2016. 2
    Google ScholarLocate open access versionFindings
  • Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer, 2016. 2
    Google ScholarLocate open access versionFindings
  • Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7282–7290, 2017. 2
    Google ScholarLocate open access versionFindings
  • Kaiyu Yue, Ming Sun, Yuchen Yuan, Feng Zhou, Errui Ding, and Fuxin Xu. Compact generalized non-local network. In Advances in Neural Information Processing Systems, pages 6510–6519, 2018. 2
    Google ScholarLocate open access versionFindings
  • Zheng-Jun Zha, Linjun Yang, Tao Mei, Meng Wang, and Zengfu Wang. Visual query suggestion. In Proceedings of the 17th ACM international conference on Multimedia, pages 15–24. ACM, 2009. 3
    Google ScholarLocate open access versionFindings
  • Zheng-Jun Zha, Linjun Yang, Tao Mei, Meng Wang, Zengfu Wang, Tat-Seng Chua, and Xian-Sheng Hua. Visual query suggestion: Towards capturing user intent in internet image search. ACM Trans. Multimedia Comput. Commun. Appl., 6(3), August 2010. 3
    Google ScholarLocate open access versionFindings
  • Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 593–602, 2019. 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments