Unbiased Scene Graph Generation from Biased Training
CVPR, pp. 3713-3722, 2020.
EI
Weibo:
Abstract:
Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse human walk on/ sit on/lay on beach into human on beach. Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG...More
Code:
Data:
Introduction
- Scene graph generation (SGG) [64] — a visual detection task of objects and their relationships in an image — seems to have never fulfilled its promise: a comprehensive visual scene representation that supports graph reasoning for high-level tasks such as visual captioning [69, 67] and VQA [56, 14].
- — yet on which are the core efforts made [71, 55, 6], pretend that there is a graph — nothing but a sparse object layout with binary links, and shroud it into graph neural networks [65] for merely more contextual object representations [67, 16, 56]
- This is partly due to the research gap in graph reasoning [2, 51, 15], the crux lies in the biased relationship prediction.
- To perform a sensible graph reasoning, the authors need to distinguish more fine-grained relationships from the ostensibly probable but trivial ones, such as replacing near with behind/in front of, and on with parking on/driving on in Figure 1(d)
Highlights
- Scene graph generation (SGG) [64] — a visual detection task of objects and their relationships in an image — seems to have never fulfilled its promise: a comprehensive visual scene representation that supports graph reasoning for high-level tasks such as visual captioning [69, 67] and VQA [56, 14]
- A promising but embarrassing finding [71] is that: by only using the statistical prior of detected object class in the Visual Genome benchmark [22], we can already achieved 30.1% on Recall@100 for Scene Graph Detection — rendering all the much more complex Scene graph generation models almost useless — that is only 1.1-1.5% lower than the state-of-the-art [5, 55, 74]
- We propose a novel unbiased Scene graph generation method based on the Total Direct Effect (TDE) analysis framework in causal inference [59, 39, 60]
- 4) X2Y: since we argued that the unbiased effect was under the effect of object features X, it directly generated SG by the outputs of X → Y branch after biased training
- We presented a general framework for unbiased Scene graph generation from biased training, and this is the first work addressing the serious bias issue in Scene graph generation
- By using the proposed Scene Graph Diagnosis toolkit, our unbiased Scene graph generation results are considerably better than their biased counterparts
Methods
- R@20 R@100 Med Baseline 11.6 39.9 155 5000 R@100
bone and scaled the longer side of input images to be 1k pixels. - On top of the frozen detector, the authors trained SGG models using SGD as optimizer.
- Batch size and initial learning rate were set to be 12 and 12 × 10−2 for PredCls and SGCls; 8 and 8 × 10−2 for SGDet. The learning rate would be decayed by 10 two times after the validation performance plateaus.
- Different from previous works [71, 55, 5], the authors didn’t assume that non-overlapping subject-object pairs are invalid in SGDet, making SGG more general
Results
- A promising but embarrassing finding [71] is that: by only using the statistical prior of detected object class in the Visual Genome benchmark [22], the authors can already achieved 30.1% on Recall@100 for Scene Graph Detection — rendering all the much more complex SGG models almost useless — that is only 1.1-1.5% lower than the state-of-the-art [5, 55, 74].
Conclusion
- The authors presented a general framework for unbiased SGG from biased training, and this is the first work addressing the serious bias issue in SGG.
- The authors achieved the unbiasedness by calculating the Total Direct Effect (TDE) with the help of a causal graph, which is a roadmap for training any SGG model.
- By using the proposed Scene Graph Diagnosis toolkit, the unbiased SGG results are considerably better than their biased counterparts.
Summary
Introduction:
Scene graph generation (SGG) [64] — a visual detection task of objects and their relationships in an image — seems to have never fulfilled its promise: a comprehensive visual scene representation that supports graph reasoning for high-level tasks such as visual captioning [69, 67] and VQA [56, 14].- — yet on which are the core efforts made [71, 55, 6], pretend that there is a graph — nothing but a sparse object layout with binary links, and shroud it into graph neural networks [65] for merely more contextual object representations [67, 16, 56]
- This is partly due to the research gap in graph reasoning [2, 51, 15], the crux lies in the biased relationship prediction.
- To perform a sensible graph reasoning, the authors need to distinguish more fine-grained relationships from the ostensibly probable but trivial ones, such as replacing near with behind/in front of, and on with parking on/driving on in Figure 1(d)
Methods:
R@20 R@100 Med Baseline 11.6 39.9 155 5000 R@100
bone and scaled the longer side of input images to be 1k pixels.- On top of the frozen detector, the authors trained SGG models using SGD as optimizer.
- Batch size and initial learning rate were set to be 12 and 12 × 10−2 for PredCls and SGCls; 8 and 8 × 10−2 for SGDet. The learning rate would be decayed by 10 two times after the validation performance plateaus.
- Different from previous works [71, 55, 5], the authors didn’t assume that non-overlapping subject-object pairs are invalid in SGDet, making SGG more general
Results:
A promising but embarrassing finding [71] is that: by only using the statistical prior of detected object class in the Visual Genome benchmark [22], the authors can already achieved 30.1% on Recall@100 for Scene Graph Detection — rendering all the much more complex SGG models almost useless — that is only 1.1-1.5% lower than the state-of-the-art [5, 55, 74].Conclusion:
The authors presented a general framework for unbiased SGG from biased training, and this is the first work addressing the serious bias issue in SGG.- The authors achieved the unbiasedness by calculating the Total Direct Effect (TDE) with the help of a causal graph, which is a roadmap for training any SGG model.
- By using the proposed Scene Graph Diagnosis toolkit, the unbiased SGG results are considerably better than their biased counterparts.
Tables
- Table1: The SGG performances of Relationship Retrieval on mean Recall@K [<a class="ref-link" id="c55" href="#r55">55</a>, <a class="ref-link" id="c6" href="#r6">6</a>]. The SGG models re-implemented under our codebase are denoted by the superscript †
- Table2: The results of Zero-Shot Relationship Retrieval
- Table3: The results of Sentence-to-Graph Retrieval
- Table4: The details of Visual Context Module
- Table5: The details of Bilinear Attention Scene Graph Encoding
- Table6: The SGG performances of Relationship Retrieval on both conventional Recall@K and mean Recall@K [<a class="ref-link" id="c55" href="#r55">55</a>, <a class="ref-link" id="c6" href="#r6">6</a>]. The SGG models reimplemented under our codebase are denoted by the superscript †
Related work
- Scene Graph Generation. SGG [64, 71] has received increasing attention in computer vision community, due to the potential revolution that would be brought to down-stream visual reasoning tasks [51, 67, 21, 16]. Most of the existing methods [64, 62, 7, 25, 70, 55, 66, 10, 43, 61] struggle for better feature extraction networks. Zellers et al [71] firstly brought the bias problem of SGG into attention and the followers [55, 6] proposed the unbiased metric (mean Recall), yet, their approaches are still restricted to the feature extraction networks, leaving the biased SGG problem unsolved. The most related work [27] just prunes those dominant and easy-to-predict relationships in the training set. Unbiased Training. The bias problem has long been investigated in machine learning [57]. Existing debiasing methods can be roughly categorized into three types: 1) data
Funding
- This work was partially supported by the NTU-Alibaba JRI
Reference
- A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR, 2018.
- P. W. Battaglia, J. B. Hamrick, V. Bapst, A. SanchezGonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
- E. Burnaev, P. Erofeev, and A. Papanov. Influence of resampling on accuracy of imbalanced classification. In ICMV, 2015.
- R. Cadene, C. Dancette, H. Ben-younes, M. Cord, and D. Parikh. Rubi: Reducing unimodal biases in visual question answering. arXiv preprint arXiv:1906.10169, 2019.
- L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S.-F. Chang. Counterfactual critic multi-agent training for scene graph generation. In ICCV, 2019.
- T. Chen, W. Yu, R. Chen, and L. Lin. Knowledge-embedded routing network for scene graph generation. In CVPR, 2019.
- B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In CVPR, 2017.
- G. Dunn, R. Emsley, H. Liu, S. Landau, J. Green, I. White, and A. Pickles. Evaluation and validation of social and psychological markers in randomised trials of complex interventions in mental health: a methodological research programme. 2015.
- R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
- J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling. Scene graph generation with external knowledge and image reconstruction. In CVPR, 2019.
- H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 2009.
- K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In ICCV, 2017.
- L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach. Women also snowboard: Overcoming bias in captioning models. In ECCV. Springer, 2018.
- D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- D. A. Hudson and C. D. Manning. Learning by abstraction: The neural state machine. NeurIPS, 2019.
- J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. In CVPR, 2018.
- J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015.
- L. Keele. The statistics of causal inference: A view from political methodology. Political Analysis, 2015.
- J.-H. Kim, J. Jun, and B.-T. Zhang. Bilinear attention networks. In Advances in Neural Information Processing Systems, 2018.
- B. G. King. A political mediation model of corporate response to social movement activism. Administrative Science Quarterly, 2008.
- R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei. Referring relationships. In CVPR, 2018.
- R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017.
- M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, 2017.
- Y. Li, Y. Li, and N. Vasconcelos. Resound: Towards action recognition without representation bias. In ECCV, 2018.
- Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang. Scene graph generation from objects, phrases and caption regions. In ICCV, 2017.
- Y. Li and N. Vasconcelos. Repair: Removing representation bias by dataset resampling. In CVPR, 2019.
- Y. Liang, Y. Bai, W. Zhang, X. Qian, L. Zhu, and T. Mei. Vrr-vg: Refocusing visually-relevant relationships. In ICCV, pages 10403–10412, 2019.
- T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In ICCV, 2017.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV. Springer, 2014.
- C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016.
- D. P. MacKinnon, A. J. Fairchild, and M. S. Fritz. Mediation analysis. Annu. Rev. Psychol., 2007.
- V. Manjunatha, N. Saini, and L. S. Davis. Explicit bias discovery in visual question answering models. In CVPR, 2019.
- F. Massa and R. Girshick. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch, 2018.
- I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In CVPR, 2016.
- S. Nair, Y. Zhu, S. Savarese, and L. Fei-Fei. Causal induction from visual observations for goal directed tasks. arXiv preprint arXiv:1910.01751, 2019.
- Y. Niu, K. Tang, H. Zhang, Z. Lu, X. Hua, and J.-R. Wen. Counterfactual vqa: A cause-effect look at language bias. arXiv, 2020.
- J. Pearl. Causality: models, reasoning and inference. Springer, 2000.
- J. Pearl. Direct and indirect effects. In Proceedings of the 17th conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2001.
- J. Pearl, M. Glymour, and N. P. Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016.
- J. Pearl and D. Mackenzie. THE BOOK OF WHY: THE NEW SCIENCE OF CAUSE AND EFFECT. Basic Books, 2018.
- J. Qi, Y. Niu, J. Huang, and H. Zhang. Two causal principles for improving visual dialog. arXiv preprint arXiv:1911.10496, 2019.
- M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo. Attentive relational networks for mapping images to scene graphs. In CVPR, 2019.
- S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 2015.
- L. Richiardi, R. Bellocco, and D. Zugna. Mediation analysis in epidemiology: methods, interpretation and bias. International journal of epidemiology, 2013.
- J. M. Robins and S. Greenland. Identifiability and exchangeability for direct and indirect effects. Epidemiology, 1992.
- N. J. Roese. Counterfactual thinking. Psychological bulletin, 1997.
- A. Rosenfeld and M. Thurston. Edge and curve detection for visual scene analysis. IEEE Transactions on computers, 1971.
- F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
- S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, 2015.
- J. Shi, H. Zhang, and J. Li. Explainable and explicit visual reasoning over scene graphs. In CVPR, 2019.
- H. A. Simon. Bounded rationality. In Utility and probability. Springer, 1990.
- K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In ACL, 2015.
- H. Z. Q. S. Tan Wang, Jianqiang Huang. Visual commonsense r-cnn. In Conference on Computer Vision and Pattern Recognition, 2020.
- K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu. Learning to compose dynamic tree structures for visual contexts. In CVPR, 2019.
- D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visual question answering. In CVPR, 2017.
- A. Torralba, A. A. Efros, et al. Unbiased look at dataset bias. In CVPR, 2011.
- N. Van Hoeck, P. D. Watson, and A. K. Barbey. Cognitive neuroscience of human counterfactual reasoning. Frontiers in human neuroscience, 2015.
- T. VanderWeele. Explanation in causal inference: methods for mediation and interaction. Oxford University Press, 2015.
- T. J. VanderWeele. A three-way decomposition of a total effect into direct, indirect, and interactive effects. Epidemiology (Cambridge, Mass.), 2013.
- W. Wang, R. Wang, S. Shan, and X. Chen. Exploring context and visual pattern of relationship for scene graph generation. In CVPR, 2019.
- S. Woo, D. Kim, D. Cho, and I. S. Kweon. Linknet: Relational embedding for scene graph. In Advances in Neural Information Processing Systems, 2018.
- S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
- D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In CVPR, 2017.
- S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 2018.
- J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph r-cnn for scene graph generation. In ECCV, 2018.
- X. Yang, K. Tang, H. Zhang, and J. Cai. Auto-encoding scene graphs for image captioning. In CVPR, 2019.
- X. Yang, H. Zhang, and J. Cai. Deconfounded image captioning: A causal retrospect, 2020.
- T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In ECCV, 2018.
- G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and C. Change Loy. Zoom-net: Mining deep feature interactions for visual relationship recognition. In ECCV, 2018.
- R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. In CVPR, 2018.
- R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In ICML, 2013.
- H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In CVPR, 2017.
- J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro. Graphical contrastive losses for scene graph parsing. In CVPR, 2019.
Full Text
Tags
Comments