Multi-Modal Dynamic Graph Transformer for Visual Grounding
IEEE Conference on Computer Vision and Pattern Recognition(2022)
Abstract
Visual grounding (VG) aims to align the correct regions of an image with a natural language query about that image. We found that existing VG methods are trapped by the single-stage grounding process that performs a sole evaluate-and-rank for meticulously prepared regions. Their performance depends on the density and quality of the candidate regions, and is capped by the inability to optimize the located regions continuously. To address these issues, we propose to remodel VG into a progressively optimized visual semantic alignment process. Our proposed multi-modal dynamic graph transformer (M-DGT) achieves this by building upon the dynamic graph structure with regions as nodes and their semantic relations as edges. Starting from a few randomly initialized regions, M-DGT is able to make sustainable adjustments (i.e., 2D spatial transformation and deletion) to the nodes and edges of the graph based on multi-modal information and the graph feature, thereby efficiently shrinking the graph to approach the ground truth regions. Experiments show that with an average of 48 boxes as initialization, the performance of M-DGT on the Flickr30k Entities and RefCOCO datasets outperforms existing state-of-the-art methods by a substantial margin, in terms of both accuracy and Intersect over Union (IOU) scores. Furthermore, introducing M-DGT to optimize the predicted regions of existing methods can further significantly improve their performance. The source codes are available at https://github.com/iQua/M-DGT.
MoreTranslated text
Key words
Vision + language, Recognition: detection,categorization,retrieval, Scene analysis and understanding
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined