Chrome Extension
WeChat Mini Program
Use on ChatGLM

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD)(2024)

Cited 0|Views3
No score
Abstract
Multimodal relation extraction (MRE) aims at predicting the semantic relation between two entities given a hybrid context of a text and its related image. Though existing MRE methods have explored different strategies to fuse multimodal information, they suffer from two limitations. First, they ignore fine-grained visual relations between objects which can provide important hints for inferring the correct relation. Second, they neglect informative textual evidence from the image, leading to a performance decline when processing text-intensive images. To address above issues, we propose a novel MRE model, named VRTE, which takes full advantage of both Visual Relations and Textual Evidence to determine the final relation label. Specifically, the input image-text pair is transformed into two scene graphs, which are further bridged into a unified multimodal graph. Next, the relation-aware Transformer is utilized to propagate information in the multimodal graph while explicitly encoding diverse relations among visual objects and textual tokens via learnable relation embeddings. Besides, a cross-attention mechanism is also used to capture valuable textual information in the OCR results and image captions, which is combined with the representations of entity nodes and original text to predict the final relation label. Experimental results on the MNRE dataset demonstrate the effectiveness of the proposed model. Extensive ablation studies are also conducted to analyse contributions of different modules.
More
Translated text
Key words
multimodal relation extraction,scene graph generation,relation-aware Transformer
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined