Unified Referring Expression Generation for Bounding Boxes and Segmentations

IEEE SIGNAL PROCESSING LETTERS(2024)

引用 0|浏览4
暂无评分
摘要
Referring expression generation (REG) is a challenging task at the intersection of computer vision and natural language processing, which aims at generating natural language descriptions that uniquely refer to a specific object within an image. Existing REG approaches solely utilize bounding boxes in a rather primitive manner to specify target objects, and employ the classical Convolutional Neural Networks (CNNs) for image encoding, followed by recurrent layers for text generation. In this letter, we propose a novel end-to-end REG model. Our model highlights the target using bounding boxes and segmentations in a unified fashion. Specifically, we propose two settings for utilizing these signals: employing them as inputs to the model and as supervision signals for pre-training tasks. Additionally, we harness the power of the recently prevailed self-attention architecture to bridge targeted visual clues and text correspondence. During inference, our method achieves state-of-the-art performance in a one-stage manner, reflecting the potential of both bounding boxes and segmentation references in constructing REG solutions.
更多
查看译文
关键词
Referring expression generation,transformer,visual-linguistic tasks,segmentation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要