Visual Commonsense R-CNN

CVPR, pp. 10757-10767, 2020.

Cited by: 10|Bibtex|Views161|DOI:https://doi.org/10.1109/CVPR42600.2020.01077
EI
Other Links: arxiv.org|dblp.uni-trier.de
Weibo:
We presented a novel unsupervised feature representation learning method called VC Regionbased Convolutional Neural Network that can be based on any Regionbased Convolutional Neural Network framework, supporting a variety of highlevel tasks by using only feature concatenation

Abstract:

We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsu...More

Code:

Data:

0
Introduction
  • It is not hard to spot the “cognitive errors” committed by machines due to the lack of common sense.
  • As shown in Figure 1, by using only the visual features, e.g., the prevailing Faster R-CNN [54] based Up-Down [2], machine usually fails to describe the exact visual relationships, or, even if the prediction is correct, the underlying visual attention is not reasonable
  • Previous works blame this for dataset bias without further justification [24, 44, 53, 7], e.g., the large concept co-occurrence gap in Figure 1; but here the authors take a closer look at it by appreciating the difference between the “visual” and “commonsense” features.
  • The authors are certainly not the first to believe that visual features should include more commonsense knowledge, rather
Highlights
  • It is not hard to spot the “cognitive errors” committed by machines due to the lack of common sense
  • The potential reason lies in the limited ability of the current question understanding, which cannot be resolved by “visual” common sense
  • We presented a novel unsupervised feature representation learning method called VC Regionbased Convolutional Neural Network that can be based on any Regionbased Convolutional Neural Network framework, supporting a variety of highlevel tasks by using only feature concatenation
  • The key novelty of VC Regionbased Convolutional Neural Network is that the learning objective is based on causal intervention, which is fundamentally different from the conventional likelihood
  • We intend to study the potential of our VC Regionbased Convolutional Neural Network applied in other modalities such as video and 3D point cloud
Methods
  • The authors used the two following datasets for unsupervised learning VC R-CNN.
  • MS-COCO Detection [36].
  • It is a popular benchmark dataset for classification, detection and segmentation in the community.
  • It contains 82,783, 40,504 and 40,775 images for training, validation and testing respectively with 80 annotated classes.
  • Recall that the VC R-CNN relies on the context
Results
  • The authors compared the VC representation with ablative features on two representative approaches: Up-Down [2] and AoANet [25].
  • The authors' proposed +VC outperforms all the other ablative representations on three answer types, achieving the state-of-the-art performance.
  • The authors' VC feature is limited by the question understanding, the authors still receive the absolute gains by just feature concatenation compared to previous methods with complicated module stack, which only achieves a slight improvement.
Conclusion
  • The authors presented a novel unsupervised feature representation learning method called VC R-CNN that can be based on any R-CNN framework, supporting a variety of highlevel tasks by using only feature concatenation.
  • The key novelty of VC R-CNN is that the learning objective is based on causal intervention, which is fundamentally different from the conventional likelihood.
  • The authors intend to study the potential of the VC R-CNN applied in other modalities such as video and 3D point cloud.
Summary
  • Introduction:

    It is not hard to spot the “cognitive errors” committed by machines due to the lack of common sense.
  • As shown in Figure 1, by using only the visual features, e.g., the prevailing Faster R-CNN [54] based Up-Down [2], machine usually fails to describe the exact visual relationships, or, even if the prediction is correct, the underlying visual attention is not reasonable
  • Previous works blame this for dataset bias without further justification [24, 44, 53, 7], e.g., the large concept co-occurrence gap in Figure 1; but here the authors take a closer look at it by appreciating the difference between the “visual” and “commonsense” features.
  • The authors are certainly not the first to believe that visual features should include more commonsense knowledge, rather
  • Methods:

    The authors used the two following datasets for unsupervised learning VC R-CNN.
  • MS-COCO Detection [36].
  • It is a popular benchmark dataset for classification, detection and segmentation in the community.
  • It contains 82,783, 40,504 and 40,775 images for training, validation and testing respectively with 80 annotated classes.
  • Recall that the VC R-CNN relies on the context
  • Results:

    The authors compared the VC representation with ablative features on two representative approaches: Up-Down [2] and AoANet [25].
  • The authors' proposed +VC outperforms all the other ablative representations on three answer types, achieving the state-of-the-art performance.
  • The authors' VC feature is limited by the question understanding, the authors still receive the absolute gains by just feature concatenation compared to previous methods with complicated module stack, which only achieves a slight improvement.
  • Conclusion:

    The authors presented a novel unsupervised feature representation learning method called VC R-CNN that can be based on any R-CNN framework, supporting a variety of highlevel tasks by using only feature concatenation.
  • The key novelty of VC R-CNN is that the learning objective is based on causal intervention, which is fundamentally different from the conventional likelihood.
  • The authors intend to study the potential of the VC R-CNN applied in other modalities such as video and 3D point cloud.
Tables
  • Table1: The image captioning performances of representative two models with ablative features on Karpathy split. The metrics: B4,. The detailed network architecture of our VC R-CNN
  • Table2: The performances of various single models on the online MS-COCO test server. Up-Down+VC and AoANet†+VC are the short for concatenated on [<a class="ref-link" id="c2" href="#r2">2</a>] in Up-Down and AoANet†. The image captioning performances of two models with ablative features (based on vanilla Faster R-CNN feature) on Karpathy split
  • Table3: Hallucination analysis [<a class="ref-link" id="c55" href="#r55">55</a>] of various models on MS-
  • Table4: Accuracy (%) of various ablative features on VQA2.0 validation set. Since the Obj achieves almost equal results with that in the original paper, here we just merge the two rows
  • Table5: Single model accuracies (%) on VQA2.0 test-dev and test set, where Up-Down+VC and MCAN+VC are the short for
  • Table6: Experimental results on VCR with various visual features. ViLBERT† [<a class="ref-link" id="c41" href="#r41">41</a>] denotes ViLBERT without pretraining process
  • Table7: Ablation studies of our proposed intervention trained on
Download tables as Excel
Related work
  • Multimodal Feature Learning. With the recent success of pre-training language models (LM) [12, 10, 51] in NLP, several approaches [41, 60, 61, 9] seek weakly-supervised learning from large, unlabelled multi-modal data to encode visual-semantic knowledge. However, all these methods suffer from the reporting bias [65, 37] of language and the great memory cost for downstream fine-tuning. In contrast, our VC R-CNN is unsupervised learning only from images and the learned feature can be simply concatenated to the original representations. Un-/Self-supervised Visual Feature Learning [14, 63, 43, 29, 76]. They aim to learn visual features through an elaborated proxy task such as denoising autoencoders [6, 66], context & rotation prediction [13, 18] and data augmentation [33]. The context prediction is learned from correlation while image rotation and augmentation can be regarded as applying the random controlled trial [50], which is active and non-observational (physical); by contrast, our VC RCNN learns from the observational causal inference that is passive and observational (imaginative). Visual Common Sense. Previous methods mainly fall into two folds: 1) learning from images with commonsense knowledge bases [65, 73, 57, 59, 68, 77] and 2) learning actions from videos [19]. However, the first one limits the common sense to the human-annotated knowledge, while the latter is essentially, again, learning from correlation. Causality in Vision. There has been a growing amount of efforts in marrying complementary strengths of deep learning and causal reasoning [49, 48] and have been explored in several contexts, including image classification [8, 40], reinforcement learning [46, 11, 5] and adversarial learning [28, 26]. Lately, we are aware of some contemporary works on visual causality such as visual dialog [52], image captioning [72] and scene graph generation [62]. Different from their task-specific causal inference, VC R-CNN offers a generic feature extractor.
Funding
  • This work was partially supported by the NTU-Alibaba JRI and the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 grant
Reference
  • Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In ECCV. Springer, 2016. 6
    Google ScholarFindings
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018. 1, 4, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015. 2, 7
    Google ScholarLocate open access versionFindings
  • Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACLW, 2005. 6
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912, 2019. 3
    Findings
  • Yoshua Bengio, Eric Laufer, Guillaume Alain, and Jason Yosinski. Deep generative stochastic networks trainable by backprop. In ICML, 2014. 3
    Google ScholarLocate open access versionFindings
  • Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, et al. Rubi: Reducing unimodal biases for visual question answering. In NIPS, 2019. 1
    Google ScholarLocate open access versionFindings
  • Krzysztof Chalupka, Pietro Perona, and Frederick Eberhardt. Visual causal feature learning. arXiv preprint arXiv:1412.2309, 2014. 3
    Findings
  • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740, 2013
    Findings
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL, July 2019. 3
    Google ScholarLocate open access versionFindings
  • Ishita Dasgupta, Jane Wang, Silvia Chiappa, Jovana Mitrovic, Pedro Ortega, David Raposo, Edward Hughes, Peter Battaglia, Matthew Botvinick, and Zeb Kurth-Nelson. Causal reasoning from meta-reinforcement learning. arXiv preprint arXiv:1901.08162, 2019. 3
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, June 2019. 2, 3
    Google ScholarLocate open access versionFindings
  • Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015. 2, 3
    Google ScholarLocate open access versionFindings
  • Justin Domke, Alap Karapurkar, and Yiannis Aloimonos. Who killed the directed model? In CVPR. IEEE, 2008. 3
    Google ScholarLocate open access versionFindings
  • Alexander DAmour. On multi-cause approaches to causal inference with unobserved counfounding: Two cautionary failure cases and a promising alternative. In AISTATS, 2019. 4
    Google ScholarLocate open access versionFindings
  • Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR, 2019. 7
    Google ScholarLocate open access versionFindings
  • James J Gibson. The theory of affordances. Hilldale, USA, 1(2), 1977. 1
    Google ScholarLocate open access versionFindings
  • Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 203
    Google ScholarLocate open access versionFindings
  • Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, volume 1, 2017. 3
    Google ScholarLocate open access versionFindings
  • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017. 7
    Google ScholarLocate open access versionFindings
  • Ibrahim Abou Halloun and David Hestenes. Common sense concepts about motion. American journal of physics, 53(11), 1985. 1
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 1
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 5
    Google ScholarLocate open access versionFindings
  • Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In ECCV. Springer, 2018. 1
    Google ScholarFindings
  • Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. Attention on attention for image captioning. In ICCV, 2019. 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Diviyan Kalainathan, Olivier Goudet, Isabelle Guyon, David Lopez-Paz, and Michele Sebag. Sam: Structural agnostic model, causal discovery and penalized adversarial learning. arXiv preprint arXiv:1803.04929, 2018. 3
    Findings
  • Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear Attention Networks. In NIPS, 2018. 7
    Google ScholarLocate open access versionFindings
  • Murat Kocaoglu, Christopher Snyder, Alexandros G. Dimakis, and Sriram Vishwanath. Causalgan: Learning causal implicit generative models with adversarial training. In ICLR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. In CVPR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Luka Cehovin, Gustavo Fernandez, Tomas Vojir, Gustav Hager, Georg Nebehay, and Roman Pflugfelder. The visual object tracking vot2015 challenge results. In ICCVW, 2015. 1
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1
    Google ScholarLocate open access versionFindings
  • Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4. IJCV, 2020. 6
    Google ScholarLocate open access versionFindings
  • Hankook Lee, Sung Ju Hwang, and Jinwoo Shin. Rethinking data augmentation: Self-supervision and self-distillation. arXiv preprint arXiv:1910.05872, 2019. 3
    Findings
  • Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In CVPR, 2019. 1
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 2004. 6
    Google ScholarFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV. Springer, 2014. 2, 5
    Google ScholarLocate open access versionFindings
  • Xiao Lin and Devi Parikh. Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In CVPR, 2015. 2, 3
    Google ScholarLocate open access versionFindings
  • Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, 2016. 1
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 1
    Google ScholarLocate open access versionFindings
  • David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, and Leon Bottou. Discovering causal signals in images. In CVPR, 2017. 3, 5
    Google ScholarLocate open access versionFindings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NIPS, 2019. 2, 3, 7, 8
    Google ScholarLocate open access versionFindings
  • Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 9(Nov), 2008. 4
    Google ScholarLocate open access versionFindings
  • Tomasz Malisiewicz and Alyosha Efros. Beyond categories: The visual memex model for reasoning about object relationships. In NIPS, 2009. 2, 3
    Google ScholarLocate open access versionFindings
  • Varun Manjunatha, Nirat Saini, and Larry S Davis. Explicit bias discovery in visual question answering models. In CVPR, 2019. 1
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. 2
    Google ScholarFindings
  • Suraj Nair, Yuke Zhu, Silvio Savarese, and Li Fei-Fei. Causal induction from visual observations for goal directed tasks. arXiv preprint arXiv:1910.01751, 2019. 3
    Findings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL. Association for Computational Linguistics, 2002. 6
    Google ScholarLocate open access versionFindings
  • Judea Pearl. Interpretation and identification of causal mediation. Psychological methods, 19(4), 2014. 3
    Google ScholarLocate open access versionFindings
  • Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016. 2, 3, 5
    Google ScholarFindings
  • Judea Pearl and Dana Mackenzie. The book of why: the new science of cause and effect. Basic Books, 2018. 2, 3
    Google ScholarFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. Two causal principles for improving visual dialog. In CVPR, 2020. 3
    Google ScholarLocate open access versionFindings
  • Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. Overcoming language priors in visual question answering with adversarial regularization. In NIPS, 2018. 1
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In EMNLP, 2018. 6, 7
    Google ScholarLocate open access versionFindings
  • Sophia A Rosenfeld. Common sense. Harvard University Press, 2011. 1
    Google ScholarFindings
  • Fereshteh Sadeghi, Santosh K Kumar Divvala, and Ali Farhadi. Viske: Visual knowledge extraction and question answering by visual verification of relation phrases. In CVPR, 2015. 3
    Google ScholarLocate open access versionFindings
  • Barry Smith. The structures of the common-sense world. 1995. 1
    Google ScholarFindings
  • Zhou Su, Chen Zhu, Yinpeng Dong, Dongqi Cai, Yurong Chen, and Jianguo Li. Learning visual knowledge memory networks for visual question answering. In CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In ICCV, 2019. 2, 3
    Google ScholarLocate open access versionFindings
  • Hao Tan and Mohit Bansal. LXMERT: Learning crossmodality encoder representations from transformers. In EMNLP-IJCNLP, Nov. 2019. 2, 3
    Google ScholarLocate open access versionFindings
  • Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from biased training. In CVPR, 2020. 3
    Google ScholarLocate open access versionFindings
  • Lucas Theis and Matthias Bethge. Generative image modeling using spatial lstms. In NIPS, 2015. 3
    Google ScholarLocate open access versionFindings
  • Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015. 6
    Google ScholarFindings
  • Ramakrishna Vedantam, Xiao Lin, Tanmay Batra, C Lawrence Zitnick, and Devi Parikh. Learning common sense through visual abstraction. In ICCV, 2015. 2, 3
    Google ScholarLocate open access versionFindings
  • Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML. ACM, 2008. 3
    Google ScholarFindings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015. 2
    Google ScholarLocate open access versionFindings
  • Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In CVPR, 2016. 3
    Google ScholarLocate open access versionFindings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015. 5
    Google ScholarLocate open access versionFindings
  • Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. Auto-encoding scene graphs for image captioning. In CVPR, 2019. 6
    Google ScholarLocate open access versionFindings
  • Xu Yang, Hanwang Zhang, and Jianfei Cai. Learning to collocate neural modules for image captioning. In ICCV, 2019. 6
    Google ScholarLocate open access versionFindings
  • Xu Yang, Hanwang Zhang, and Jianfei Cai. Deconfounded image captioning: A causal retrospect. arXiv preprint arXiv:2003.03923, 2020. 3
    Findings
  • Mark Yatskar, Vicente Ordonez, and Ali Farhadi. Stating the obvious: Extracting visual common sense knowledge. In NAACL, 2016. 3
    Google ScholarLocate open access versionFindings
  • Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In CVPR, 2019. 7
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In CVPR, 2019. 2, 7
    Google ScholarLocate open access versionFindings
  • Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised semi-supervised learning. In ICCV, 2019. 3
    Google ScholarLocate open access versionFindings
  • Yuke Zhu, Alireza Fathi, and Li Fei-Fei. Reasoning about object affordances in a knowledge base representation. In ECCV. Springer, 2014. 3
    Google ScholarLocate open access versionFindings
  • [1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pages 6077–6086, 2018. 1, 4
    Google ScholarLocate open access versionFindings
  • [2] Pierre Baldi and Peter Sadowski. The dropout learning algorithm. Artificial intelligence, 210:78–122, 2014. 2
    Google ScholarLocate open access versionFindings
  • [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255.
    Google ScholarLocate open access versionFindings
  • [4] Felix Elwert and Christopher Winship. Endogenous selection bias: The problem of conditioning on a collider variable. Annual review of sociology, 40:31–53, 2014. 3
    Google ScholarLocate open access versionFindings
  • [5] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017. 4
    Google ScholarLocate open access versionFindings
  • [6] David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, and Leon Bottou. Discovering causal signals in images. In CVPR, pages 6979–6987, 2017. 3
    Google ScholarLocate open access versionFindings
  • [7] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014. 2
    Google ScholarLocate open access versionFindings
  • [8] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015. 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments