AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose FEELVOS which learns a semantic embedding for segmenting multiple objects in an end-to-end way

FEELVOS: Fast End-to-End Embedding Learning for Video Object Segmentation.

CVPR, (2019)

Cited by: 154|Views298
EI
Full Text
Bibtex
Weibo

Abstract

Many of the recent successful methods for video object segmentation (VOS) are overly complicated, heavily rely on fine-tuning on the first frame, and/or are slow, and are hence of limited practical use. In this work, we propose FEELVOS as a simple and fast method which does not rely on fine-tuning. In order to segment a video, for each fr...More

Code:

Data:

0
Introduction
  • Video object segmentation (VOS) is a fundamental task in computer vision, with important applications including video editing, robotics, and self-driving cars.
  • The authors focus on the semi-supervised VOS setup in which the ground truth segmentation masks of one or multiple objects are given for the first frame in a video.
  • The task is to automatically estimate the segmentation masks of the given objects for the rest of the video.
  • With the recent advances in deep learning and the introduction of the DAVIS datasets [29, 31], there has been tremendous progress in tackling the semi-supervised VOS task.
Highlights
  • Video object segmentation (VOS) is a fundamental task in computer vision, with important applications including video editing, robotics, and self-driving cars
  • We focus on the semi-supervised VOS setup in which the ground truth segmentation masks of one or multiple objects are given for the first frame in a video
  • We propose Fast End-to-End Embedding Learning for Video Object Segmentation (FEELVOS) to meet all of our design goals
  • We achieve a new state of the art for multi-object segmentation without finetuning on the DAVIS 2017 validation dataset with a J &F mean score of 71.5%
  • We propose FEELVOS which learns a semantic embedding for segmenting multiple objects in an end-to-end way
  • We showed experimentally that each component of FEELVOS is highly effective and we achieve new state of the art results on DAVIS 2017 for VOS without fine-tuning
Methods
  • The idea of the embedding space is that pixels belonging to the same object instance will be close in the embedding space and pixels which belong to distinct objects will be far away
  • Note that this is not explicitly enforced, since instead of using distances in the embedding space directly to produce a segmentation like in PML [6] or VideoMatch [17], the authors use them as a soft cue which can be refined by the dynamic segmentation head.
  • In practice the embedding behaves in this way since this delivers a strong cue to the dynamic segmentation head for the final segmentation
Results
  • OSMN [40] FAVOS [7] PML [6] VideoMatch [17] RGMP [37] (-sim.
  • Data) RGMP [37] FEELVOS FEELVOS denotes fine-tuning, and t/s denotes time per frame in seconds.
  • FEELVOS fails to segment some parts of the back of the cat.
  • This is most likely because the back texture was not seen in the first frame.
  • Afterwards, FEELVOS is able to recover from that error
Conclusion
  • The authors started with the observation that there are many strong methods for VOS, but many of them lack practical usability
  • Based on this insight, the authors defined several design goals which a practical useful method for VOS should fulfill.
  • Most importantly the authors aim for a fast and simple method which yet achieves strong results
  • To this end, the authors propose FEELVOS which learns a semantic embedding for segmenting multiple objects in an end-to-end way.
Tables
  • Table1: Design goals overview. The table shows which of our design goals (described in more detail in the text) are achieved by recent methods. Our method is the only one which fulfills all design goals
  • Table2: Quantitative results on the DAVIS 2017 validation set. FT
  • Table3: Quantitative results on the DAVIS 2017 test-dev set. FT
  • Table4: Quantitative results on the DAVIS 2016 validation set. FT
  • Table5: Quantitative results on the YouTube-Objects dataset. FT
  • Table6: Ablation study on DAVIS 2017. FF and PF denote first frame and previous frame, respectively, and GM and LM denote global matching and local matching. PFP denotes using the previous frame predictions as input to the dynamic segmentation head
Download tables as Excel
Related work
  • Video Object Segmentation with First-Frame Finetuning. Many approaches for semi-supervised video object segmentation rely on fine-tuning using the first-frame ground truth. OSVOS [1] uses a convolutional network, pre-trained for foreground-background segmentation, and fine-tunes it on the first-frame ground truth of the target video at test time. OnAVOS [35, 34] and OSVOS-S [27] extend OSVOS by an online adaptation mechanism, and by semantic information from an instance segmentation network, respectively. Another approach is to learn to propagate the segmentation mask from one frame to the next using optical flow as done by MaskTrack [28]. This approach is extended by LucidTracker [20] which introduces an elaborate data augmentation mechanism. Hu et al [15] propose a motion-guided cascaded refinement network which works on a coarse segmentation from an active contour model. MaskRNN [16] uses a recurrent neural network to fuse the output of two deep networks. Location-sensitive embeddings used to refine an initial foreground prediction are explored in LSE [9]. MoNet [38] exploits optical flow motion cues by feature alignment and a distance transform layer. Using reinforcement learning to estimate a region of interest to be segmented is explored by Han et al [13]. DyeNet [22] uses a deep recurrent network which combines a temporal propagation and a re-identification module. PReMVOS [26, 24, 25] combines four different neural networks together with extensive fine-tuning and a merging algorithm and won the 2018 DAVIS Challenge [2] and also the 2018 YouTube-VOS challenge [39].
Funding
  • We achieve a new state of the art in video object segmentation without fine-tuning with a J &F measure of 71.5% on the DAVIS 2017 validation set
  • • Strong: The system should deliver strong results, with more than 65% J &F score on the DAVIS 2017 validation set
  • We achieve a new state of the art for multi-object segmentation without finetuning on the DAVIS 2017 validation dataset with a J &F mean score of 71.5%
  • OSMN [40], FAVOS [7], PML [6], and VideoMatch [17] all achieve very high speed and effectively bypass finetuning, but we show that the proposed FEELVOS produces significantly better results
  • For non-fine-tuning methods FEELVOS achieves a new state of the art with a J &F score of 71.5% which is 4.8% higher than RGMP and 2.4% higher when not using YouTube-VOS data for training (denoted by -YTB-VOS)
  • On the DAVIS 2017 test-dev set, FEELVOS achieves a J &F score of 57.8%, which is 4.9% higher than the result of RGMP [37]
  • The results drop even more to 54.9% which shows that matching to the previous frame using the learned embedding is extremely important to achieve good results
  • We showed that each component of FEELVOS is useful, that matching in embedding space to the previous frame is extremely effective, and that the proposed local previous frame matching performs significantly better than globally matching to the previous frame
  • We showed experimentally that each component of FEELVOS is highly effective and we achieve new state of the art results on DAVIS 2017 for VOS without fine-tuning
Reference
  • S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixe, D. Cremers, and L. Van Gool. One-shot video object segmentation. In CVPR, 2017. 2, 7
    Google ScholarLocate open access versionFindings
  • S. Caelles, A. Montes, K.-K. Maninis, Y. Chen, L. Van Gool, F. Perazzi, and J. Pont-Tuset. The 2018 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1803.00557, 2018. 3
    Findings
  • L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI, 2017. 6
    Google ScholarLocate open access versionFindings
  • L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017. 6
    Findings
  • L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. 3, 6
    Google ScholarLocate open access versionFindings
  • Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blazingly fast video object segmentation with pixel-wise metric learning. In CVPR, 2018. 1, 2, 3, 4, 7
    Google ScholarLocate open access versionFindings
  • J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. Fast and accurate online video object segmentation via tracking parts. In CVPR, 2018. 1, 3, 6, 7
    Google ScholarLocate open access versionFindings
  • F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017. 6
    Google ScholarLocate open access versionFindings
  • H. Ci, C. Wang, and Y. Wang. Video object segmentation by learning location-sensitive embeddings. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009. 6
    Google ScholarLocate open access versionFindings
  • A. Dosovitskiy, P. Fischery, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015. 4
    Google ScholarLocate open access versionFindings
  • A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277, 2017. 3, 4
    Findings
  • J. Han, L. Yang, D. Zhang, X. Chang, and X. Liang. Reinforcement cutting-agent learning for video object segmentation. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 6
    Findings
  • P. Hu, G. Wang, X. Kong, J. Kuen, and Y.-P. Tan. Motionguided cascaded refinement network for video object segmentation. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Y.-T. Hu, J.-B. Huang, and A. G. Schwing. Maskrnn: Instance level video object segmentation. In NeurIPS, 2017. 2
    Google ScholarLocate open access versionFindings
  • Y.-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch: Matching based video object segmentation. In ECCV, 2018. 1, 3, 4, 6, 7
    Google ScholarLocate open access versionFindings
  • S. Ioffe and C. Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 6
    Google ScholarLocate open access versionFindings
  • S. D. Jain and K. Grauman. Supervoxel-consistent foreground propagation in video. In ECCV, 2014. 6, 7
    Google ScholarLocate open access versionFindings
  • A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for object tracking. In arXiv preprint arXiv: 1703.09554, 2017. 2
    Findings
  • S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C. J. Kuo. Instance embedding transfer to unsupervised video object segmentation. In CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • X. Li and C. Change Loy. Video object segmentation with joint re-identification and attention-aware mask propagation. In ECCV, 2018. 2, 6
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 6
    Google ScholarLocate open access versionFindings
  • J. Luiten, P. Voigtlaender, and B. Leibe. PReMVOS: Proposal-generation, refinement and merging for the davis challenge on video object segmentation 2018. The 2018 DAVIS Challenge on Video Object Segmentation - CVPR Workshops, 2018. 1, 2, 6
    Google ScholarFindings
  • J. Luiten, P. Voigtlaender, and B. Leibe. PReMVOS: Proposal-generation, refinement and merging for the YouTube-VOS challenge on video object segmentation 2018. The 1st Large-scale Video Object Segmentation Challenge ECCV Workshops, 2018. 1, 2, 6
    Google ScholarFindings
  • J. Luiten, P. Voigtlaender, and B. Leibe. PReMVOS: Proposal-generation, refinement and merging for video object segmentation. In ACCV, 2018. 1, 2, 6, 7
    Google ScholarLocate open access versionFindings
  • K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. L. Taixe, and L. Van Gool. Video object segmentation without temporal information. PAMI, 2018. 2, 6
    Google ScholarLocate open access versionFindings
  • F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In CVPR, 2017. 2, 3, 7
    Google ScholarLocate open access versionFindings
  • F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016. 1, 6
    Google ScholarLocate open access versionFindings
  • T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Fullresolution residual networks for semantic segmentation in street scenes. In CVPR, 2017. 6
    Google ScholarLocate open access versionFindings
  • J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. SorkineHornung, and L. Van Gool. The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017. 1, 6
    Findings
  • A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In CVPR, 2012. 6, 7
    Google ScholarLocate open access versionFindings
  • H. Qi, Z. Zhang, B. Xiao, H. Hu, B. Cheng, Y. Wei, and J. Dai. Deformable convolutional networks – coco detection and segmentation challenge 2017 entry. ICCV COCO Challenge Workshop, 2017. 6
    Google ScholarFindings
  • P. Voigtlaender and B. Leibe. Online adaptation of convolutional neural networks for the 2017 DAVIS challenge on video object segmentation. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops, 2017. 2, 6
    Google ScholarFindings
  • P. Voigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. In BMVC, 2017. 1, 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Z. Wu, C. Shen, and A. v. d. Hengel. Bridging categorylevel and instance-level semantic image segmentation. arXiv preprint arXiv:1605.06885, 2016. 6
    Findings
  • S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim. Fast video object segmentation by reference-guided mask propagation. In CVPR, 2018. 1, 3, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang. Monet: Deep motion exploitation for video object segmentation. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price, S. Cohen, and T. Huang. YouTube-VOS: Sequence-tosequence video object segmentation. In ECCV, 2018. 3, 6
    Google ScholarLocate open access versionFindings
  • L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efficient video object segmentation via network modulation. In CVPR, 2018. 1, 3, 6, 7
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科