Cascade Grouped Attention Network for Referring Expression Segmentation
MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020(2020)
摘要
Referring expression segmentation (RES) aims to segment the target instance in a given image according to a natural language expression. Its main challenge lies in how to quickly and accurately align the text expression to the referred visual instances. In this paper, we focus on addressing this issue by proposing a Cascade Grouped Attention Network (CGAN) with two innovative designs: Cascade Grouped Attention (CGA) and Instance-level Attention (ILA) loss. Specifically, CGA is used to perform step-wise reasoning over the entire image to perceive the differences between instances accurately yet efficiently, so as to identify the referent. ILA loss is further embedded into each step of CGA to directly supervise the attention modeling, which improves the alignments between the text expression and the visual instances. Through these two novel designs, CGAN can achieve the high efficiency of one-stage RES while possessing a strong reasoning ability comparable to the two-stage methods. To validate our model, we conduct extensive experiments on three RES benchmark datasets and achieve significant performance gains over existing one-stage and multi-stage models
更多查看译文
关键词
Referring Expression Segmentation, Attention Network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络