Unified Referring Expression Generation for Bounding Boxes and Segmentations

Zongtao Liu,Tianyang Xu,Xiaoning Song,Xiao-Jun Wu

IEEE SIGNAL PROCESSING LETTERS（2024）

引用 0|浏览4

暂无评分

摘要

Referring expression generation (REG) is a challenging task at the intersection of computer vision and natural language processing, which aims at generating natural language descriptions that uniquely refer to a specific object within an image. Existing REG approaches solely utilize bounding boxes in a rather primitive manner to specify target objects, and employ the classical Convolutional Neural Networks (CNNs) for image encoding, followed by recurrent layers for text generation. In this letter, we propose a novel end-to-end REG model. Our model highlights the target using bounding boxes and segmentations in a unified fashion. Specifically, we propose two settings for utilizing these signals: employing them as inputs to the model and as supervision signals for pre-training tasks. Additionally, we harness the power of the recently prevailed self-attention architecture to bridge targeted visual clues and text correspondence. During inference, our method achieves state-of-the-art performance in a one-stage manner, reflecting the potential of both bounding boxes and segmentation references in constructing REG solutions.

查看译文

关键词

Referring expression generation,transformer,visual-linguistic tasks,segmentation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要