Iterative Learning with Extra and Inner Knowledge for Long-tail Dynamic Scene Graph Generation

MM '23: Proceedings of the 31st ACM International Conference on Multimedia(2023)

Cited 0|Views5
No score
Dynamic scene graphs have become a powerful tool for higher-level visual understanding tasks, and the interest in dynamic scene graph generation (dynamic SGG) is grown over time. Recently, numbers of existing methods achieve significant progress in dynamic SGG by capturing temporal information with transformer or recurrent network structures. However, most existing methods only focus on predicting the head predicates, which ignore the long-tail phenomenon, thus the tail predicates are hard to be recognized. In this paper, we propose a novel method named Iterative Learning with Extra and Inner Knowledge (I2LEK) to address the long-tail problem in dynamic SGG. The extra knowledge is obtained from commonsense, while inner knowledge is defined as the temporal evolution patterns of visual relationships. Specifically, we introduce extra knowledge to enrich the representations of predicates in the spatial dimension and adopt inner knowledge to implement knowledge sharing in the temporal dimension. With enriched representations and shared knowledge, I2LEK can accurately predict both the tail and head predicates. Moreover, an iterative learning strategy is proposed to fuse the extra knowledge, inner knowledge, and spatial-temporal context contained in videos, which further enhances the model's understanding of visual relationships. Our experimental results on the public Action Genome dataset demonstrate that our model achieves state-of-the-art performance.
Translated text
AI Read Science
Must-Reading Tree
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined