Video Object Segmentation With Referring Expressions

COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV(2019)

引用 10|浏览123
暂无评分
摘要
Most semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first video frame. However, obtaining a detailed mask is expensive and time-consuming. In this work we explore a more practical and natural way of identifying a target object by employing language referring expressions. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our approach we augment the popular video object segmentation benchmarks, DAVIS(16) and DAVIS(17), with language descriptions of target objects. We show that our approach performs on par with the methods which have access to the object mask on DAVIS(16) and is competitive to methods using scribbles on challenging DAVIS(17).
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要