Unbiased Scene Graph Generation from Biased Training

CVPR, pp. 3713-3722, 2020.

Cited by: 8|Bibtex|Views112|DOI:https://doi.org/10.1109/CVPR42600.2020.00377
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We presented a general framework for unbiased Scene graph generation from biased training, and this is the first work addressing the serious bias issue in Scene graph generation

Abstract:

Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse human walk on/ sit on/lay on beach into human on beach. Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG...More

Code:

Data:

0
Introduction
  • Scene graph generation (SGG) [64] — a visual detection task of objects and their relationships in an image — seems to have never fulfilled its promise: a comprehensive visual scene representation that supports graph reasoning for high-level tasks such as visual captioning [69, 67] and VQA [56, 14].
  • — yet on which are the core efforts made [71, 55, 6], pretend that there is a graph — nothing but a sparse object layout with binary links, and shroud it into graph neural networks [65] for merely more contextual object representations [67, 16, 56]
  • This is partly due to the research gap in graph reasoning [2, 51, 15], the crux lies in the biased relationship prediction.
  • To perform a sensible graph reasoning, the authors need to distinguish more fine-grained relationships from the ostensibly probable but trivial ones, such as replacing near with behind/in front of, and on with parking on/driving on in Figure 1(d)
Highlights
  • Scene graph generation (SGG) [64] — a visual detection task of objects and their relationships in an image — seems to have never fulfilled its promise: a comprehensive visual scene representation that supports graph reasoning for high-level tasks such as visual captioning [69, 67] and VQA [56, 14]
  • A promising but embarrassing finding [71] is that: by only using the statistical prior of detected object class in the Visual Genome benchmark [22], we can already achieved 30.1% on Recall@100 for Scene Graph Detection — rendering all the much more complex Scene graph generation models almost useless — that is only 1.1-1.5% lower than the state-of-the-art [5, 55, 74]
  • We propose a novel unbiased Scene graph generation method based on the Total Direct Effect (TDE) analysis framework in causal inference [59, 39, 60]
  • 4) X2Y: since we argued that the unbiased effect was under the effect of object features X, it directly generated SG by the outputs of X → Y branch after biased training
  • We presented a general framework for unbiased Scene graph generation from biased training, and this is the first work addressing the serious bias issue in Scene graph generation
  • By using the proposed Scene Graph Diagnosis toolkit, our unbiased Scene graph generation results are considerably better than their biased counterparts
Methods
  • R@20 R@100 Med Baseline 11.6 39.9 155 5000 R@100

    bone and scaled the longer side of input images to be 1k pixels.
  • On top of the frozen detector, the authors trained SGG models using SGD as optimizer.
  • Batch size and initial learning rate were set to be 12 and 12 × 10−2 for PredCls and SGCls; 8 and 8 × 10−2 for SGDet. The learning rate would be decayed by 10 two times after the validation performance plateaus.
  • Different from previous works [71, 55, 5], the authors didn’t assume that non-overlapping subject-object pairs are invalid in SGDet, making SGG more general
Results
  • A promising but embarrassing finding [71] is that: by only using the statistical prior of detected object class in the Visual Genome benchmark [22], the authors can already achieved 30.1% on Recall@100 for Scene Graph Detection — rendering all the much more complex SGG models almost useless — that is only 1.1-1.5% lower than the state-of-the-art [5, 55, 74].
Conclusion
  • The authors presented a general framework for unbiased SGG from biased training, and this is the first work addressing the serious bias issue in SGG.
  • The authors achieved the unbiasedness by calculating the Total Direct Effect (TDE) with the help of a causal graph, which is a roadmap for training any SGG model.
  • By using the proposed Scene Graph Diagnosis toolkit, the unbiased SGG results are considerably better than their biased counterparts.
Summary
  • Introduction:

    Scene graph generation (SGG) [64] — a visual detection task of objects and their relationships in an image — seems to have never fulfilled its promise: a comprehensive visual scene representation that supports graph reasoning for high-level tasks such as visual captioning [69, 67] and VQA [56, 14].
  • — yet on which are the core efforts made [71, 55, 6], pretend that there is a graph — nothing but a sparse object layout with binary links, and shroud it into graph neural networks [65] for merely more contextual object representations [67, 16, 56]
  • This is partly due to the research gap in graph reasoning [2, 51, 15], the crux lies in the biased relationship prediction.
  • To perform a sensible graph reasoning, the authors need to distinguish more fine-grained relationships from the ostensibly probable but trivial ones, such as replacing near with behind/in front of, and on with parking on/driving on in Figure 1(d)
  • Methods:

    R@20 R@100 Med Baseline 11.6 39.9 155 5000 R@100

    bone and scaled the longer side of input images to be 1k pixels.
  • On top of the frozen detector, the authors trained SGG models using SGD as optimizer.
  • Batch size and initial learning rate were set to be 12 and 12 × 10−2 for PredCls and SGCls; 8 and 8 × 10−2 for SGDet. The learning rate would be decayed by 10 two times after the validation performance plateaus.
  • Different from previous works [71, 55, 5], the authors didn’t assume that non-overlapping subject-object pairs are invalid in SGDet, making SGG more general
  • Results:

    A promising but embarrassing finding [71] is that: by only using the statistical prior of detected object class in the Visual Genome benchmark [22], the authors can already achieved 30.1% on Recall@100 for Scene Graph Detection — rendering all the much more complex SGG models almost useless — that is only 1.1-1.5% lower than the state-of-the-art [5, 55, 74].
  • Conclusion:

    The authors presented a general framework for unbiased SGG from biased training, and this is the first work addressing the serious bias issue in SGG.
  • The authors achieved the unbiasedness by calculating the Total Direct Effect (TDE) with the help of a causal graph, which is a roadmap for training any SGG model.
  • By using the proposed Scene Graph Diagnosis toolkit, the unbiased SGG results are considerably better than their biased counterparts.
Tables
  • Table1: The SGG performances of Relationship Retrieval on mean Recall@K [<a class="ref-link" id="c55" href="#r55">55</a>, <a class="ref-link" id="c6" href="#r6">6</a>]. The SGG models re-implemented under our codebase are denoted by the superscript †
  • Table2: The results of Zero-Shot Relationship Retrieval
  • Table3: The results of Sentence-to-Graph Retrieval
  • Table4: The details of Visual Context Module
  • Table5: The details of Bilinear Attention Scene Graph Encoding
  • Table6: The SGG performances of Relationship Retrieval on both conventional Recall@K and mean Recall@K [<a class="ref-link" id="c55" href="#r55">55</a>, <a class="ref-link" id="c6" href="#r6">6</a>]. The SGG models reimplemented under our codebase are denoted by the superscript †
Download tables as Excel
Related work
  • Scene Graph Generation. SGG [64, 71] has received increasing attention in computer vision community, due to the potential revolution that would be brought to down-stream visual reasoning tasks [51, 67, 21, 16]. Most of the existing methods [64, 62, 7, 25, 70, 55, 66, 10, 43, 61] struggle for better feature extraction networks. Zellers et al [71] firstly brought the bias problem of SGG into attention and the followers [55, 6] proposed the unbiased metric (mean Recall), yet, their approaches are still restricted to the feature extraction networks, leaving the biased SGG problem unsolved. The most related work [27] just prunes those dominant and easy-to-predict relationships in the training set. Unbiased Training. The bias problem has long been investigated in machine learning [57]. Existing debiasing methods can be roughly categorized into three types: 1) data
Funding
  • This work was partially supported by the NTU-Alibaba JRI
Reference
  • A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. SanchezGonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
    Findings
  • E. Burnaev, P. Erofeev, and A. Papanov. Influence of resampling on accuracy of imbalanced classification. In ICMV, 2015.
    Google ScholarLocate open access versionFindings
  • R. Cadene, C. Dancette, H. Ben-younes, M. Cord, and D. Parikh. Rubi: Reducing unimodal biases in visual question answering. arXiv preprint arXiv:1906.10169, 2019.
    Findings
  • L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S.-F. Chang. Counterfactual critic multi-agent training for scene graph generation. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • T. Chen, W. Yu, R. Chen, and L. Lin. Knowledge-embedded routing network for scene graph generation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • G. Dunn, R. Emsley, H. Liu, S. Landau, J. Green, I. White, and A. Pickles. Evaluation and validation of social and psychological markers in randomised trials of complex interventions in mental health: a methodological research programme. 2015.
    Google ScholarFindings
  • R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling. Scene graph generation with external knowledge and image reconstruction. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 2009.
    Google ScholarLocate open access versionFindings
  • K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach. Women also snowboard: Overcoming bias in captioning models. In ECCV. Springer, 2018.
    Google ScholarFindings
  • D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • D. A. Hudson and C. D. Manning. Learning by abstraction: The neural state machine. NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • L. Keele. The statistics of causal inference: A view from political methodology. Political Analysis, 2015.
    Google ScholarLocate open access versionFindings
  • J.-H. Kim, J. Jun, and B.-T. Zhang. Bilinear attention networks. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • B. G. King. A political mediation model of corporate response to social movement activism. Administrative Science Quarterly, 2008.
    Google ScholarLocate open access versionFindings
  • R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei. Referring relationships. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017.
    Google ScholarLocate open access versionFindings
  • M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Li, Y. Li, and N. Vasconcelos. Resound: Towards action recognition without representation bias. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang. Scene graph generation from objects, phrases and caption regions. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Li and N. Vasconcelos. Repair: Removing representation bias by dataset resampling. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Liang, Y. Bai, W. Zhang, X. Qian, L. Zhu, and T. Mei. Vrr-vg: Refocusing visually-relevant relationships. In ICCV, pages 10403–10412, 2019.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV. Springer, 2014.
    Google ScholarLocate open access versionFindings
  • C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • D. P. MacKinnon, A. J. Fairchild, and M. S. Fritz. Mediation analysis. Annu. Rev. Psychol., 2007.
    Google ScholarLocate open access versionFindings
  • V. Manjunatha, N. Saini, and L. S. Davis. Explicit bias discovery in visual question answering models. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • F. Massa and R. Girshick. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch, 2018.
    Google ScholarLocate open access versionFindings
  • I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • S. Nair, Y. Zhu, S. Savarese, and L. Fei-Fei. Causal induction from visual observations for goal directed tasks. arXiv preprint arXiv:1910.01751, 2019.
    Findings
  • Y. Niu, K. Tang, H. Zhang, Z. Lu, X. Hua, and J.-R. Wen. Counterfactual vqa: A cause-effect look at language bias. arXiv, 2020.
    Google ScholarFindings
  • J. Pearl. Causality: models, reasoning and inference. Springer, 2000.
    Google ScholarFindings
  • J. Pearl. Direct and indirect effects. In Proceedings of the 17th conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2001.
    Google ScholarLocate open access versionFindings
  • J. Pearl, M. Glymour, and N. P. Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016.
    Google ScholarFindings
  • J. Pearl and D. Mackenzie. THE BOOK OF WHY: THE NEW SCIENCE OF CAUSE AND EFFECT. Basic Books, 2018.
    Google ScholarFindings
  • J. Qi, Y. Niu, J. Huang, and H. Zhang. Two causal principles for improving visual dialog. arXiv preprint arXiv:1911.10496, 2019.
    Findings
  • M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo. Attentive relational networks for mapping images to scene graphs. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 2015.
    Google ScholarLocate open access versionFindings
  • L. Richiardi, R. Bellocco, and D. Zugna. Mediation analysis in epidemiology: methods, interpretation and bias. International journal of epidemiology, 2013.
    Google ScholarLocate open access versionFindings
  • J. M. Robins and S. Greenland. Identifiability and exchangeability for direct and indirect effects. Epidemiology, 1992.
    Google ScholarLocate open access versionFindings
  • N. J. Roese. Counterfactual thinking. Psychological bulletin, 1997.
    Google ScholarLocate open access versionFindings
  • A. Rosenfeld and M. Thurston. Edge and curve detection for visual scene analysis. IEEE Transactions on computers, 1971.
    Google ScholarLocate open access versionFindings
  • F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, 2015.
    Google ScholarLocate open access versionFindings
  • J. Shi, H. Zhang, and J. Li. Explainable and explicit visual reasoning over scene graphs. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • H. A. Simon. Bounded rationality. In Utility and probability. Springer, 1990.
    Google ScholarFindings
  • K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In ACL, 2015.
    Google ScholarLocate open access versionFindings
  • H. Z. Q. S. Tan Wang, Jianqiang Huang. Visual commonsense r-cnn. In Conference on Computer Vision and Pattern Recognition, 2020.
    Google ScholarLocate open access versionFindings
  • K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu. Learning to compose dynamic tree structures for visual contexts. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visual question answering. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • A. Torralba, A. A. Efros, et al. Unbiased look at dataset bias. In CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • N. Van Hoeck, P. D. Watson, and A. K. Barbey. Cognitive neuroscience of human counterfactual reasoning. Frontiers in human neuroscience, 2015.
    Google ScholarLocate open access versionFindings
  • T. VanderWeele. Explanation in causal inference: methods for mediation and interaction. Oxford University Press, 2015.
    Google ScholarFindings
  • T. J. VanderWeele. A three-way decomposition of a total effect into direct, indirect, and interactive effects. Epidemiology (Cambridge, Mass.), 2013.
    Google ScholarFindings
  • W. Wang, R. Wang, S. Shan, and X. Chen. Exploring context and visual pattern of relationship for scene graph generation. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • S. Woo, D. Kim, D. Cho, and I. S. Kweon. Linknet: Relational embedding for scene graph. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph r-cnn for scene graph generation. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • X. Yang, K. Tang, H. Zhang, and J. Cai. Auto-encoding scene graphs for image captioning. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • X. Yang, H. Zhang, and J. Cai. Deconfounded image captioning: A causal retrospect, 2020.
    Google ScholarFindings
  • T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and C. Change Loy. Zoom-net: Mining deep feature interactions for visual relationship recognition. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In ICML, 2013.
    Google ScholarLocate open access versionFindings
  • H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro. Graphical contrastive losses for scene graph parsing. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments