AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We present a novel framework for semi-supervised video object segmentation

Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement

NIPS 2020, (2020)

Cited by: 0|Views14
EI
Full Text
Bibtex
Weibo

Abstract

We propose a new matching-based framework for semi-supervised video object segmentation (VOS). Recently, state-of-the-art VOS performance has been achieved by matching-based algorithms, in which feature banks are created to store features for region matching and classification. However, how to effectively organize information in the con...More
0
Introduction
  • Video object segmentation (VOS) is a fundamental step in many video processing tasks, like video editing and video inpainting.
  • Explicit approaches learn object appearance explicitly
  • They often formulate the segmentation as pixel-wise classification in a learnt embedding space [31, 4, 24, 11, 33, 18, 12, 25, 17].
  • These approaches first construct an embedding space to memorize the object appearance, segment the subsequent frames by computing similarity
  • They are called matching-based methods.
  • Matching-based methods achieve the state-of-the-art results in the VOS benchmark
Highlights
  • Video object segmentation (VOS) is a fundamental step in many video processing tasks, like video editing and video inpainting
  • We propose an adaptive feature bank (AFB) to organize the target object features
  • Our main contributions of this work are three-folded: (1) We propose an adaptive and efficient feature bank to maintain most useful information for video object segmentation
  • 5.1 Datasets and evaluation metrics We evaluated our model (AFB-uncertain-region refinement (URR)) on DAVIS17 [28] and YouTube-VOS18 [35], two large-scale VOS benchmarks with multiple objects
  • We present a novel framework for semi-supervised video object segmentation
  • Our method significantly outperforms the existing methods
  • The uncertain-region refinement is designed for refining ambiguous regions
Methods
  • In Table 3, RVOS [30] and A-GAME [12] achieved lower scores because they failed to recognize the object of interest after 1K frames.
  • The authors found that STM [25] was able to store at most 50 frames per video.
  • The authors tuned this hyper-parameter and left the others unchanged.
Results
  • The validation sets contains 474 videos with the first frame annotation
  • It includes objects from 65 training categories, and 26 unseen categories in training.
  • The authors' framework achieves the best overall score 79.6 because the adaptive feature bank improves the robustness and reliability for different scenarios.
  • For those videos whose objects are already been seen in the training videos, STM’s results are somewhat better than ours.
  • The authors' proposed model has great generalization and achieves the state-of-the-art performance
Conclusion
  • The authors present a novel framework for semi-supervised video object segmentation.
  • The authors' framework includes an adaptive feature bank (AFB) module and an uncertain-region refinement (URR) module.
  • The adaptive feature bank effectively organizes key features for segmentation.
  • The uncertain-region refinement is designed for refining ambiguous regions.
  • The authors train the framework by minimizing the typical segmentation cross-entropy loss plus an innovative confidence loss.
  • The authors' approach outperforms the state-of-the-art methods on two large-scale benchmark datasets
Summary
  • Introduction:

    Video object segmentation (VOS) is a fundamental step in many video processing tasks, like video editing and video inpainting.
  • Explicit approaches learn object appearance explicitly
  • They often formulate the segmentation as pixel-wise classification in a learnt embedding space [31, 4, 24, 11, 33, 18, 12, 25, 17].
  • These approaches first construct an embedding space to memorize the object appearance, segment the subsequent frames by computing similarity
  • They are called matching-based methods.
  • Matching-based methods achieve the state-of-the-art results in the VOS benchmark
  • Methods:

    In Table 3, RVOS [30] and A-GAME [12] achieved lower scores because they failed to recognize the object of interest after 1K frames.
  • The authors found that STM [25] was able to store at most 50 frames per video.
  • The authors tuned this hyper-parameter and left the others unchanged.
  • Results:

    The validation sets contains 474 videos with the first frame annotation
  • It includes objects from 65 training categories, and 26 unseen categories in training.
  • The authors' framework achieves the best overall score 79.6 because the adaptive feature bank improves the robustness and reliability for different scenarios.
  • For those videos whose objects are already been seen in the training videos, STM’s results are somewhat better than ours.
  • The authors' proposed model has great generalization and achieves the state-of-the-art performance
  • Conclusion:

    The authors present a novel framework for semi-supervised video object segmentation.
  • The authors' framework includes an adaptive feature bank (AFB) module and an uncertain-region refinement (URR) module.
  • The adaptive feature bank effectively organizes key features for segmentation.
  • The uncertain-region refinement is designed for refining ambiguous regions.
  • The authors train the framework by minimizing the typical segmentation cross-entropy loss plus an innovative confidence loss.
  • The authors' approach outperforms the state-of-the-art methods on two large-scale benchmark datasets
Tables
  • Table1: The quantitative evaluation on the validation set of the DAVIS17 benchmark [<a class="ref-link" id="c28" href="#r28">28</a>] in percentages. +YV indicates the use of YouTube-VOS for training. OL means it needs online learning
  • Table2: The quantitative evaluation on the validation set of the YouTube-VOS18 benchmark [<a class="ref-link" id="c35" href="#r35">35</a>] in percentages. OL means it needs online learning
  • Table3: The quantitative evaluation on the Long-time Video dataset in percentages
  • Table4: Ablation study using the validation set of the DAVIS17 benchmark [<a class="ref-link" id="c28" href="#r28">28</a>]
Download tables as Excel
Related work
  • Recent video object segmentation works can be divided into two categories: implicit learning and explicit learning. The implicit learning approaches include detection-based methods [3, 23] which segment the object mask without using temporal information, and propagation-based methods [27, 10, 32, 2, 13, 9] which use masks computed in previously frames to infer masks in the current frame. These methods often adopt a fully convolutional network (FCN) structure to learn object appearance by network weights implicitly; so they often require an online learning to adapt to new objects in the test video.

    The explicit learning methods first construct an embedding space to memorize the object appearance, then classify each pixel’s label using their similarity. Thus, the explicit learning is also called matching-based method. A key issue in matching-based VOS segmentation is how to build the embedding space. DMM [36] only uses the first frame’s information. RGMP [24], FEELVOS [31], RANet [33] and AGSS [18] store information from the first and the latest frames. VideoMatch [11] and WaterNet [17] store information from several latest frames using a slide window. STM [25] stores features every T frames (T = 5 in their experiments). However, when the video to segment is long, these static strategies could encounter out-of-memory crashes or miss sampling key-frames. Our proposed adaptive feature bank (AFB) is a first non-uniform frame-sampling strategy in VOS that can more flexibly and dynamically manage objects’ key features in videos. AFB performs dynamic feature merging and removal, and can handle videos with any length effectively.
Funding
  • Acknowledgments and Disclosure of Funding This work is partly supported by Louisiana Board of Regents ITRS LEQSF(2018-21)-RD-B-03, and National Science Foundation of USA OIA-1946231
Reference
  • Jeroen CJH Aerts, WJ Wouter Botzen, Kerry Emanuel, Ning Lin, Hans De Moel, and Erwann O Michel-Kerjan. Evaluating flood resilience strategies for coastal megacities. Science, 344(6183):473–475, 2014.
    Google ScholarLocate open access versionFindings
  • Linchao Bao, Baoyuan Wu, and Wei Liu. CNN in MRF: Video Object Segmentation via Inference in a CNN-Based Higher-Order Spatio-Temporal MRF. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5977–5986, Salt Lake City, UT, June 2018. IEEE.
    Google ScholarFindings
  • Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 221–230, 2017.
    Google ScholarLocate open access versionFindings
  • Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, and Luc Van Gool. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1189–1198, 2018.
    Google ScholarLocate open access versionFindings
  • Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Philip H. S. Torr, and Shi-Min Hu. Global contrast based salient region detection. IEEE TPAMI, 37(3):569–582, 2015.
    Google ScholarLocate open access versionFindings
  • Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
    Google ScholarLocate open access versionFindings
  • Bernd Fritzke. A growing neural gas network learns topologies. In Advances in neural information processing systems, pages 625–632, 1995.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. pages 770–778, 2016.
    Google ScholarFindings
  • Ping Hu, Gang Wang, Xiangfei Kong, Jason Kuen, and Yap-Peng Tan. Motion-guided cascaded refinement network for video object segmentation. pages 1400–1409.
    Google ScholarFindings
  • Yuan-Ting Hu, Jia-Bin Huang, and Alexander Schwing. MaskRNN: Instance Level Video Object Segmentation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 325–334. Curran Associates, Inc., 2017.
    Google ScholarLocate open access versionFindings
  • Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. Videomatch: Matching based video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 54–70, 2018.
    Google ScholarLocate open access versionFindings
  • Joakim Johnander, Martin Danelljan, Emil Brissman, Fahad Shahbaz Khan, and Michael Felsberg. A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8953–8962, 2019.
    Google ScholarLocate open access versionFindings
  • A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for object tracking. In The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops, 2017.
    Google ScholarFindings
  • Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. PointRend: Image segmentation as rendering. pages 9799–9808.
    Google ScholarFindings
  • Weicheng Kuo, Anelia Angelova, Jitendra Malik, and Tsung-Yi Lin. ShapeMask: Learning to segment novel objects by refining shape priors. pages 9207–9216.
    Google ScholarFindings
  • Yin Li, Xiaodi Hou, Christof Koch, James M Rehg, and Alan L Yuille. The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 280–287, 2014.
    Google ScholarLocate open access versionFindings
  • Yongqing Liang, Navid Jafari, Xing Luo, Qin Chen, Yanpeng Cao, and Xin Li. Waternet: An adaptive matching pipeline for segmenting water with volatile appearance. Computational Visual Media, pages 1–14, 2020.
    Google ScholarLocate open access versionFindings
  • Huaijia Lin, Xiaojuan Qi, and Jiaya Jia. AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation. pages 3949–3957, 2019.
    Google ScholarFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755.
    Google ScholarLocate open access versionFindings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
    Google ScholarLocate open access versionFindings
  • Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
    Findings
  • Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation. July 2018.
    Google ScholarFindings
  • K-K Maninis, S Caelles, Y Chen, J Pont-Tuset, L Leal-Taixe, D Cremers, and L Van Gool. Video Object Segmentation without Temporal Information. IEEE transactions on pattern analysis and machine intelligence, 41(6):1515–1530, June 2019. Place: United States Publisher: IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. Fast Video Object Segmentation by Reference-Guided Mask Propagation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7376–7385, June 2018. ISSN: 2575-7075.
    Google ScholarLocate open access versionFindings
  • Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video Object Segmentation Using Space-Time Memory Networks. pages 9226–9235, 2019.
    Google ScholarFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
    Google ScholarFindings
  • F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A.Sorkine-Hornung. Learning video object segmentation from static images. In Computer Vision and Pattern Recognition, 2017.
    Google ScholarLocate open access versionFindings
  • Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 DAVIS Challenge on Video Object Segmentation. arXiv:1704.00675 [cs], March 2018. arXiv: 1704.00675.
    Findings
  • Jianping Shi, Qiong Yan, Li Xu, and Jiaya Jia. Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence, 38(4):717–729, 2015.
    Google ScholarLocate open access versionFindings
  • Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques, and Xavier Giro-i Nieto. Rvos: End-to-end recurrent network for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5277–5286, 2019.
    Google ScholarLocate open access versionFindings
  • Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and LiangChieh Chen. Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9481–9490, 2019.
    Google ScholarLocate open access versionFindings
  • Paul Voigtlaender and Bastian Leibe. Online adaptation of convolutional neural networks for video object segmentation. In BMVC, 2017.
    Google ScholarLocate open access versionFindings
  • Ziqin Wang, Jun Xu, Li Liu, Fan Zhu, and Ling Shao. RANet: Ranking Attention Network for Fast Video Object Segmentation. pages 3978–3987, 2019.
    Google ScholarFindings
  • Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 585–601, 2018.
    Google ScholarLocate open access versionFindings
  • Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark. arXiv:1809.03327 [cs], September 2018. arXiv: 1809.03327.
    Findings
  • Xiaohui Zeng, Renjie Liao, Li Gu, Yuwen Xiong, Sanja Fidler, and Raquel Urtasun. DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation. pages 3929–3938, 2019.
    Google ScholarFindings
Author
Yongqing Liang
Yongqing Liang
Navid Jafari
Navid Jafari
Jim Chen
Jim Chen
Your rating :
0

 

Tags
Comments
小科