Graph Pooling Inference Network for Text-based VQA

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS(2024)

引用 0|浏览10
暂无评分
摘要
Effectively leveraging objects and optical character recognition (OCR) tokens to reason out pivotal scene text is critical for the challenging Text-based Visual Question Answering (TextVQA) task. Graph-based models can effectively capture the semantic relationship among visual entities (objects and tokens) and report remarkable performance in TextVQA. However, previous efforts usually leverage all visual entities and ignore the negative effect of superfluous entities. This article presents a Graph Pooling Inference Network (GPIN), which is an evolutionary graph learning method to purify the visual entities and capture the core semantics. It is observed that the dense distribution of reduplicative objects and the crowd of semantically dependent OCR tokens usually co-exist in the image. Motivated by this, GPIN adopts an adaptive node dropping strategy to dynamically downscale semantically closed nodes for graph evolution and update. To deepen the comprehension of scene text, GPIN is a dual-path hierarchical graph architecture that progressively aggregates the evolved object graph and the evolved token graph semantics into a graph vector that serves as visual cues to facilitate the answer reasoning. It can effectively eliminate object redundancy and enhance the association of semantically continuous tokens. Experiments conducted on TextVQA and ST-VQA datasets show that GPIN achieves promising performance compared with state-of-the-art methods.
更多
查看译文
关键词
Text-based visual question answering,graph inference,graph pooling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要