Iterative Context-Aware Graph Inference for Visual Dialog

CVPR, pp. 10052-10061, 2020.

Cited by: 4|Bibtex|Views134|DOI:https://doi.org/10.1109/CVPR42600.2020.01007
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We propose a novel Context-Aware Graph neural network

Abstract:

Visual dialog is a challenging task that requires the comprehension of the semantic dependencies among implicit visual and textual contexts. This task can refer to the relation inference in a graphical model with sparse contexts and unknown graph structure (relation descriptor), and how to model the underlying context-aware relation inf...More

Code:

Data:

0
Introduction
  • Cross-modal semantic understanding between vision and language has attracted more and more interests, such as image captioning [35, 4, 36, 40, 20], referring expression [11, 41, 21], and visual question answering (VQA) [3, 18, 38, 39].
Highlights
  • Cross-modal semantic understanding between vision and language has attracted more and more interests, such as image captioning [35, 4, 36, 40, 20], referring expression [11, 41, 21], and visual question answering (VQA) [3, 18, 38, 39]
  • We propose a Context-Aware Graph (CAG) neural network for visual dialog, which targets at discovering the partially relevant contexts and building the dynamic graph structure
  • In our experiment, compared methods can be grouped into three types: (1) Fusion-based Models (LF [5] and hierarchical recurrent network [5]); (2) Attention-based Models (HREA [5], MN [5], history-conditioned image attention [23], attention memory [29], CoAtt [34], CorefNMN [15], dual visual attention [9], RVA [24], Synergistic [10], DAN [13], and HACAN [37]); and (3) Graph-based Methods (GNN [42] and FGA [28])
  • Context-Aware Graph consistently outperforms most of methods
  • We propose a fine-grained Context-Aware Graph (CAG) neural network for visual dialog, which contains both visual-objects and textual-history context semantics
  • V1.0 datasets validate the effectiveness of the proposed approach and display explainable visualization results
Methods
  • VisDial v0.9 contains 83k and 40k dialogs on COCO-train and COCO-val images [19] respectively, totally 1.2M QA pairs.
  • The new train, validation, and test splits contains 123k, 2k and 8k dialogs, respectively.
  • Each dialog in VisDial v0.9 consists of 10-round QA pairs for each image.
  • In the test split of VisDial v1.0, each dialog has flexible m rounds of QA pairs, where m is in the range of 1 to 10.
  • The model is trained with a multi-class N -pair loss [10, 23]
Results
  • For attention-based models, compared to DAN [13], CAG outperforms it at all evaluation metrics.
  • In iterative step t = 2, the question changes the attention to both words of “snowboarder” and “wearing”.
  • These two object nodes dynamically update their neighbor nodes under the guidance of the current question command qw(t=2).
  • The graph attention map overlaying the image the author demonstrates the effectiveness of the graph inference
Conclusion
  • The authors propose a fine-grained Context-Aware Graph (CAG) neural network for visual dialog, which contains both visual-objects and textual-history context semantics.
  • An adaptive top-K message passing mechanism is proposed to iteratively explore the context-aware representations of nodes and update the edge relationships for a better answer inferring.
  • The authors' solution is a dynamic directed-graph inference process.
  • Experimental results on the VisDial v0.9 t = 1 t = 2 t = 3.
  • {Q} in different iterative steps on VisDial v1.0.
  • V1.0 datasets validate the effectiveness of the proposed approach and display explainable visualization results
Summary
  • Introduction:

    Cross-modal semantic understanding between vision and language has attracted more and more interests, such as image captioning [35, 4, 36, 40, 20], referring expression [11, 41, 21], and visual question answering (VQA) [3, 18, 38, 39].
  • Methods:

    VisDial v0.9 contains 83k and 40k dialogs on COCO-train and COCO-val images [19] respectively, totally 1.2M QA pairs.
  • The new train, validation, and test splits contains 123k, 2k and 8k dialogs, respectively.
  • Each dialog in VisDial v0.9 consists of 10-round QA pairs for each image.
  • In the test split of VisDial v1.0, each dialog has flexible m rounds of QA pairs, where m is in the range of 1 to 10.
  • The model is trained with a multi-class N -pair loss [10, 23]
  • Results:

    For attention-based models, compared to DAN [13], CAG outperforms it at all evaluation metrics.
  • In iterative step t = 2, the question changes the attention to both words of “snowboarder” and “wearing”.
  • These two object nodes dynamically update their neighbor nodes under the guidance of the current question command qw(t=2).
  • The graph attention map overlaying the image the author demonstrates the effectiveness of the graph inference
  • Conclusion:

    The authors propose a fine-grained Context-Aware Graph (CAG) neural network for visual dialog, which contains both visual-objects and textual-history context semantics.
  • An adaptive top-K message passing mechanism is proposed to iteratively explore the context-aware representations of nodes and update the edge relationships for a better answer inferring.
  • The authors' solution is a dynamic directed-graph inference process.
  • Experimental results on the VisDial v0.9 t = 1 t = 2 t = 3.
  • {Q} in different iterative steps on VisDial v1.0.
  • V1.0 datasets validate the effectiveness of the proposed approach and display explainable visualization results
Tables
  • Table1: Ablation studies of different iterative steps T and the main components on VisDial val v0.9
  • Table2: Performance comparison on VisDial val v0.9 with VGG
  • Table3: Main comparisons on both VisDial v0.9 and v1.0 datasets using the discriminative decoder [<a class="ref-link" id="c23" href="#r23">23</a>]
  • Table4: Ablation studies of different adjacent correlation matrix learning strategies on VisDial val v0.9
Download tables as Excel
Funding
  • This work is supported by the National Natural Science Foundation of China (NSFC) under Grants 61725203, 61876058, 61732008, 61622211, and U19B2038
Reference
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pages 6077–6086, 2018. 3
    Google ScholarLocate open access versionFindings
  • Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In CVPR, pages 39–48, 2016. 2
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015. 1
    Google ScholarLocate open access versionFindings
  • Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In CVPR, pages 6298–6306, 2017. 1
    Google ScholarLocate open access versionFindings
  • Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose M. F. Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, pages 326–335, 2017. 1, 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. Guesswhat?! visual object discovery through multi-modal dialogue. In CVPR, pages 5503–5512, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. Multi-step reasoning via recurrent dual attention for visual dialog. In ACL, pages 6463–6474, 2019. 2
    Google ScholarLocate open access versionFindings
  • Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. Scene graph generation with external knowledge and image reconstruction. In CVPR, pages 1969– 1978, 2019. 2
    Google ScholarLocate open access versionFindings
  • Dan Guo, Hui Wang, and Meng Wang. Dual visual attention network for visual dialog. In IJCAI, pages 4989–4995, 7 2012, 6, 7
    Google ScholarLocate open access versionFindings
  • Dalu Guo, Chang Xu, and Dacheng Tao. Image-questionanswer synergistic network for visual dialog. In CVPR, pages 10434–10443, 2019. 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. Natural language object retrieval. In CVPR, pages 4555–4564, 2016. 1
    Google ScholarLocate open access versionFindings
  • Drew A. Hudson and Christopher D. Manning. Compositional attention networks for machine reasoning. In ICLR, pages 1–1, 2018. 2
    Google ScholarLocate open access versionFindings
  • Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. Dual attention networks for visual reference resolution in visual dialog. In EMNLP, pages 2024–2033, 2019. 6, 7
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 205
    Findings
  • Satwik Kottur, Jose M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. Visual coreference resolution in visual dialog using neural module networks. In ECCV, pages 153–169, 2018. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Satwik Kottur, Jose M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. In NAACL, pages 582–595, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. Actional-structural graph convolutional networks for skeleton-based action recognition. In CVPR, pages 3595–3603, 2019. 2
    Google ScholarLocate open access versionFindings
  • Junwei Liang, Lu Jiang, Liangliang Cao, Yannis Kalantidis, Li-Jia Li, and Alexander G. Hauptmann. Focal visual-text attention for memex question answering. IEEE transactions on pattern analysis and machine intelligence, 41(8):1893– 1908, 2019. 1
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014. 5
    Google ScholarLocate open access versionFindings
  • Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. Context-aware visual policy network for sequence-level image captioning. In ACM MM, pages 1416–1424, 2018. 1
    Google ScholarLocate open access versionFindings
  • Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qingming Huang. Adaptive reconstruction network for weakly supervised referring expression grounding. In ICCV, pages 2611–2620, 2019. 1
    Google ScholarLocate open access versionFindings
  • Yong Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Structure inference net: Object detection using scene-level context and instance-level relationships. In CVPR, pages 6985–6994, 2018. 2
    Google ScholarLocate open access versionFindings
  • Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In NeurIPS, pages 314–324. 2017. 2, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, and Ji-Rong Wen. Recursive visual attention in visual dialog. In CVPR, pages 6679–6688, 2019. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. Learning conditioned graph structures for interpretable visual question answering. In NeurIPS, pages 8344–8353, 2018. 2
    Google ScholarLocate open access versionFindings
  • Y. Peng and J. Chi. Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2019. 2
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014. 5
    Google ScholarLocate open access versionFindings
  • Idan Schwartz, Seunghak Yu, Tamir Hazan, and Alexander G Schwing. Factor graph attention. In CVPR, pages 2039– 2048, 2019. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, and Leonid Sigal. Visual reference resolution using attention memory for visual dialog. In NeurIPS, pages 3719–372017. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15(1):1929–1958, 2014. 5
    Google ScholarLocate open access versionFindings
  • Damien Teney, Lingqiao Liu, and Anton van den Hengel. Graph-structured representations for visual question answering. In CVPR, pages 3233–3241, 2017. 2
    Google ScholarLocate open access versionFindings
  • Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR, pages 1960–1968, 2019. 2
    Google ScholarLocate open access versionFindings
  • Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. A theoretical analysis of NDCG type ranking measures. In COLT, pages 25–54, 2013. 7
    Google ScholarLocate open access versionFindings
  • Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, and Anton van den Hengel. Are you talking to me? reasoned visual dialog generation through adversarial learning. In CVPR, pages 6106–6115, 2018. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057, 2015. 1
    Google ScholarLocate open access versionFindings
  • Min Yang, Wei Zhao, Wei Xu, Yabing Feng, Zhou Zhao, Xiaojun Chen, and Kai Lei. Multitask learning for crossdomain image captioning. IEEE Transactions on Multimedia, 21(4):1047–1061, 2019. 1
    Google ScholarLocate open access versionFindings
  • Tianhao Yang, Zheng-Jun Zha, and Hanwang Zhang. Making history matter: History-advantage sequence training for visual dialog. In ICCV, pages 2561–2569, 2019. 6, 7
    Google ScholarLocate open access versionFindings
  • Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In CVPR, pages 6281–6290, 2019. 1
    Google ScholarLocate open access versionFindings
  • Zheng-Jun Zha, Jiawei Liu, Tianhao Yang, and Yongdong Zhang. Spatiotemporal-textual co-attention network for video question answering. The ACM Transactions on Multimedia Computing, Communications, and Applications, 15(2s):53:1–53:18, 2019. 1
    Google ScholarLocate open access versionFindings
  • Zhen-Jun Zha, Daqin Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. Context-aware visual policy network for fine-grained image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2019. 1
    Google ScholarLocate open access versionFindings
  • Fang Zhao, Jianshu Li, Jian Zhao, and Jiashi Feng. Weakly supervised phrase localization with multi-scale anchored transformer network. In CVPR, pages 5696–5705, 2018. 1
    Google ScholarLocate open access versionFindings
  • Zilong Zheng, Wenguan Wang, Siyuan Qi, and Song-Chun Zhu. Reasoning visual dialogs with structural and partial observations. In CVPR, pages 6669–6678, 2019. 2, 6, 7
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments