SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention
CoRR(2024)
摘要
3D visual grounding aims to automatically locate the 3D region of the
specified object given the corresponding textual description. Existing works
fail to distinguish similar objects especially when multiple referred objects
are involved in the description. Experiments show that direct matching of
language and visual modal has limited capacity to comprehend complex
referential relationships in utterances. It is mainly due to the interference
caused by redundant visual information in cross-modal alignment. To strengthen
relation-orientated mapping between different modalities, we propose SeCG, a
semantic-enhanced relational learning model based on a graph network with our
designed memory graph attention layer. Our method replaces original
language-independent encoding with cross-modal encoding in visual analysis.
More text-related feature expressions are obtained through the guidance of
global semantics and implicit relationships. Experimental results on ReferIt3D
and ScanRefer benchmarks show that the proposed method outperforms the existing
state-of-the-art methods, particularly improving the localization performance
for the multi-relation challenges.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要