What Are You Talking About? Text-to-Image Coreference

Chen Kong,Dahua Lin,Mohit Bansal,Raquel Urtasun,Sanja Fidler

CVPR（2014）

引用 231|浏览66

暂无评分

摘要

In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disambiguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effectiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D detection and scene classification accuracy, and is able to reliably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system [15].

查看译文

关键词

text-to-image coreference,visual information,3d object detection,rgb-d scenes,3d semantic parsing,natural lingual descriptions,stanford coreference system,image resolution,text and images,text and images, 3d object detection, scene understanding,natural sentential descriptions,nyu-rgbd v2 dataset,scene understanding,noun-pronoun,natural language processing,structure prediction model,text-to-image alignment,text analysis,visualization,image segmentation,accuracy,solid modeling,semantics

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要