VD BERT: A Unified Vision and Dialog Transformer with BERT

EMNLP 2020, pp. 3325-3338, 2020.

Other Links: academic.microsoft.com
Weibo:
We have presented VD-BERT, a unified visiondialog Transformer model that exploits the pretrained BERT language models for visual dialog

Abstract:

Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective...More

Code:

Data:

0
Introduction
  • Visual Dialog aims to build an AI agent that can answer a human’s questions about visual content in a natural conversational setting (Das et al, 2017).
  • Compared to VQA that predicts an answer based only on the question about the image (Figure 1a), VisDial needs to consider the dialog history.
  • Most of previous work (Niu et al, 2019; Gan et al, 2019; Kang et al, 2019) use the question as a query to attend to relevant image regions and dialog history, where their interactions are usually further exploited to obtain better visual-historical cues for predicting the answer.
  • The attention flow in these methods is unidirectional – from question to the other components (Figure 1b)
Highlights
  • Visual Dialog aims to build an AI agent that can answer a human’s questions about visual content in a natural conversational setting (Das et al, 2017)
  • Inspired by its recent success in vision-language pretraining, we further extend BERT to achieve simple yet effective fusion of vision and dialog contents in VisDial tasks
  • We present VD-BERT, a novel unified vision-dialog Transformer framework for VisDial tasks
  • Following Das et al (2017), we evaluate our model using the ranking metrics like Recall@K (K ∈ {1, 5, 10}), Mean Reciprocal Rank (MRR), and Mean Rank, where only one answer is considered as correct
  • We find that the visually grounded Masked Language Modeling (MLM) is crucial for transferring BERT into the multimodal setting, indicated by a large performance drop when using only Next Sentence Prediction (NSP)
  • We have presented VD-BERT, a unified visiondialog Transformer model that exploits the pretrained BERT language models for visual dialog
Methods
  • V0.9 contains a training set of 82, 783 images and a validation set of 40, 504 images.
  • The v1.0 dataset combines the training and validation sets of v0.9 into one training set and adds another 2, 064 images for validation and 8, 000 images for testing.
  • Each image is associated with one caption and 10 question-answer pairs
  • For each question, it is paired with a list of 100 answer candidates, one of which is regarded as the correct answer
Results
Conclusion
  • The authors have presented VD-BERT, a unified visiondialog Transformer model that exploits the pretrained BERT language models for visual dialog.
  • VD-BERT is capable of modeling all the interactions between an image and a multi-turn dialog within a single-stream Transformer encoder and enables the effective fusion of features from both modalities via simple visually grounded training.
  • It can either rank or generate answers seamlessly.
  • The authors further conduct thorough experiments to analyze and interpret the model, providing insights for future transfer learning research on visual dialog tasks
Summary
  • Introduction:

    Visual Dialog aims to build an AI agent that can answer a human’s questions about visual content in a natural conversational setting (Das et al, 2017).
  • Compared to VQA that predicts an answer based only on the question about the image (Figure 1a), VisDial needs to consider the dialog history.
  • Most of previous work (Niu et al, 2019; Gan et al, 2019; Kang et al, 2019) use the question as a query to attend to relevant image regions and dialog history, where their interactions are usually further exploited to obtain better visual-historical cues for predicting the answer.
  • The attention flow in these methods is unidirectional – from question to the other components (Figure 1b)
  • Objectives:

    The authors aim to capture dense interactions among both inter-modality and intra-modality.
  • Methods:

    V0.9 contains a training set of 82, 783 images and a validation set of 40, 504 images.
  • The v1.0 dataset combines the training and validation sets of v0.9 into one training set and adds another 2, 064 images for validation and 8, 000 images for testing.
  • Each image is associated with one caption and 10 question-answer pairs
  • For each question, it is paired with a list of 100 answer candidates, one of which is regarded as the correct answer
  • Results:

    The authors compare the VD-BERT with state-of-the-art published models, including NMN (Hu et al, 2017), CorefNMN (Kottur et al, 2018), GNN (Zheng et al, 2019), FGA (Schwartz et al, 2019), DVAN (Guo et al, 2019b), RvA (Niu et al, 2019), DualVD (Jiang et al, 2019), Discriminative Setting.
  • Conclusion:

    The authors have presented VD-BERT, a unified visiondialog Transformer model that exploits the pretrained BERT language models for visual dialog.
  • VD-BERT is capable of modeling all the interactions between an image and a multi-turn dialog within a single-stream Transformer encoder and enables the effective fusion of features from both modalities via simple visually grounded training.
  • It can either rank or generate answers seamlessly.
  • The authors further conduct thorough experiments to analyze and interpret the model, providing insights for future transfer learning research on visual dialog tasks
Tables
  • Table1: Summary of results on the test-std split of VisDial v1.0 dataset. The results are reported by the test server. “†” denotes ensemble model and “∗” indicates fine-tuning on dense annotations. The “↑” denotes higher value for better performance and “↓” is the opposite. The best and second-best results in each column are in bold and underlined respectively
  • Table2: Discriminative and generative results of various models on the val split of VisDial v0.9 dataset
  • Table3: Extensive ablation studies: (a) various training settings and (b) training contexts on v1.0 val; (c) Dense annotation fine-tuning with varying ranking methods and (d) various ensemble strategies on v1.0 test-std
  • Table4: NDCG scores in VisDial v1.0 val split broken down into 4 groups based on either the relevance score or the question type. The % value in the parentheses denotes the corresponding data proportion
Download tables as Excel
Related work
  • Visual Dialog. The Visual Dialog task has been recently proposed by Das et al (2017), where a dialog agent needs to answer a series of questions grounded by an image. It is one of the most challenging vision-language tasks that require not only to understand the image content according to texts, but also to reason through the dialog history. Previous work (Lu et al, 2017; Seo et al, 2017; Wu et al, 2018; Kottur et al, 2018; Jiang et al, 2019; Yang et al, 2019; Guo et al, 2019a; Niu et al, 2019) focuses on developing a variety of attention mechanisms to model the interactions among entities including image, question, and dialog history. For example, Kang et al (2019) proposed DAN, a dual attention module to first refer to relevant contexts in the dialog history, and then find indicative image regions. ReDAN, proposed by Gan et al (2019), further explores the interactions between image and dialog history via multi-step reasoning.
Study subjects and analysis
question-answer pairs: 10
The v1.0 dataset combines the training and validation sets of v0.9 into one training set and adds another 2, 064 images for validation and 8, 000 images for testing (hosted blindly in the task organizers’ server). Each image is associated with one caption and 10 question-answer pairs. For each question, it is paired with a list of 100 answer candidates, one of which is regarded as the correct answer

Reference
  • Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. 2019. Fusion of detected objects in text for visual question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2131–2140. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: visual question answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2425–2433.
    Google ScholarLocate open access versionFindings
  • Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR, abs/1607.06450.
    Findings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, volume 227 of ACM International Conference Proceeding Series, pages 129–136. ACM.
    Google ScholarLocate open access versionFindings
  • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. UNITER: learning universal image-text representations. CoRR, abs/1909.11740.
    Findings
  • Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jose M. F. Moura, Devi Parikh, and Dhruv Batra. 201Visual dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1080–1089.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 201Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 13042–13054.
    Google ScholarLocate open access versionFindings
  • Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. 2019. Multi-step reasoning via recurrent dual attention for visual dialog. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 6463–6474.
    Google ScholarLocate open access versionFindings
  • Dalu Guo, Chang Xu, and Dacheng Tao. 2019a. Imagequestion-answer synergistic network for visual dialog. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 10434–10443.
    Google ScholarLocate open access versionFindings
  • Dan Guo, Hui Wang, and Meng Wang. 2019b. Dual visual attention network for visual dialog. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 4989–4995. ijcai.org.
    Google ScholarLocate open access versionFindings
  • Dan Guo, Hui Wang, Hanwang Zhang, Zheng-Jun Zha, and Meng Wang. 2020. Iterative contextaware graph inference for visual dialog. CoRR, abs/2004.02194.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 804–813. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Xiaoze Jiang, Jing Yu, Zengchang Qin, Yingying Zhuang, Xingxing Zhang, Yue Hu, and Qi Wu. 2019. Dualvd: An adaptive dual encoding model for deep visual understanding in visual dialogue. CoRR, abs/1911.07251.
    Findings
  • Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. 2019. Dual attention networks for visual reference resolution in visual dialog. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2024–2033. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy and Fei-Fei Li. 2015. Deep visualsemantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3128–3137. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 787–798. ACL.
    Google ScholarLocate open access versionFindings
  • Hyounghun Kim, Hao Tan, and Mohit Bansal. 20Modality-balanced models for visual dialogue. CoRR, abs/2001.06354.
    Findings
  • Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Satwik Kottur, Jose M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2018. Visual coreference resolution in visual dialog using neural module networks. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, pages 160– 178.
    Google ScholarLocate open access versionFindings
  • Vishvak Murahari, Dhruv Batra, Devi Parikh, and Abhishek Das. 2019. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. CoRR, abs/1912.02379.
    Findings
  • Van-Quang Nguyen, Masanori Suganuma, and Takayuki Okatani. 2019. Efficient attention mechanism for handling all the interactions between many inputs with application to visual dialog. CoRR, abs/1911.11390.
    Findings
  • Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, and Ji-Rong Wen. 2019. Recursive visual attention in visual dialog. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6679–6688.
    Google ScholarLocate open access versionFindings
  • Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73.
    Google ScholarLocate open access versionFindings
  • Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019a. Unicoder-vl: A universal encoder for vision and language by cross-modal pretraining. CoRR, abs/1908.06066.
    Findings
  • Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019b. Visualbert: A simple and performant baseline for vision and language. CoRR, abs/1908.03557.
    Findings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 814
    Google ScholarLocate open access versionFindings
  • December 2019, Vancouver, BC, Canada, pages 13–23.
    Google ScholarFindings
  • Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. 2017. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 314–324.
    Google ScholarLocate open access versionFindings
  • Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. 2019. Two causal principles for improving visual dialog. CoRR, abs/1911.10496.
    Findings
  • Tao Qin, Tie-Yan Liu, and Hang Li. 2010. A general approximation framework for direct optimization of information retrieval measures. Inf. Retr., 13(4):375–397.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99.
    Google ScholarLocate open access versionFindings
  • Idan Schwartz, Seunghak Yu, Tamir Hazan, and Alexander G. Schwing. 2019. Factor graph attention. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 2039–2048.
    Google ScholarLocate open access versionFindings
  • Paul Hongsuck Seo, Andreas M. Lehrmann, Bohyung Han, and Leonid Sigal. 2017. Visual reference resolution using attention memory for visual dialog. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 3719–3729.
    Google ScholarLocate open access versionFindings
  • Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2556–2565.
    Google ScholarLocate open access versionFindings
  • Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: pretraining of generic visual-linguistic representations. CoRR, abs/1908.08530.
    Findings
  • Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 6418–6428.
    Google ScholarLocate open access versionFindings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 7463–7472. IEEE.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Hao Tan and Mohit Bansal. 2019. LXMERT: learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 5099–5110. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Taesun Whang, Dongyub Lee, Chanhee Lee, Kisu Yang, Dongsuk Oh, and Heuiseok Lim. 2019. Domain adaptive training BERT for response selection. CoRR, abs/1908.04812.
    Findings
  • Qi Wu, Peng Wang, Chunhua Shen, Ian D. Reid, and Anton van den Hengel. 2018. Are you talking to me? reasoned visual dialog generation through adversarial learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt
    Google ScholarLocate open access versionFindings
  • Lake City, UT, USA, June 18-22, 2018, pages 6106– 6115.
    Google ScholarFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
    Findings
  • Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise approach to learning to rank: theory and algorithm. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, volume 307 of ACM International Conference Proceeding Series, pages 1192–1199. ACM.
    Google ScholarLocate open access versionFindings
  • Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. 2019. Visual entailment: A novel task for fine-grained image understanding. CoRR, abs/1901.06706.
    Findings
  • Tianhao Yang, Zheng-Jun Zha, and Hanwang Zhang. 2019. Making history matter: History-advantage sequence training for visual dialog. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 2561–2569. IEEE.
    Google ScholarLocate open access versionFindings
  • Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6720–6731. Computer Vision Foundation / IEEE.
    Google ScholarLocate open access versionFindings
  • Zilong Zheng, Wenguan Wang, Siyuan Qi, and SongChun Zhu. 2019. Reasoning visual dialogs with structural and partial observations. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6669–6678.
    Google ScholarLocate open access versionFindings
  • Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2019. Unified vision-language pre-training for image captioning and VQA. CoRR, abs/1909.11059.
    Findings
Full Text
Your rating :
0

 

Tags
Comments