Video Instance Segmentation

CoRR, 2019.

Cited by: 20|Views475
EI
Weibo:
We propose a new method combining single-frame instance segmentation and object tracking, which aims to provide some early explorations towards this task

Abstract:

In this paper we present a new computer vision task, named video instance segmentation. The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos. In words, it is the first time that the image instance segmentation problem is extended to the video domain. To facilitate research on this new tas...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Segmentation in images and videos is one of the fundamental problems in computer vision.
  • The task of instance segmentation, i.e. simultaneous detection and segmentation of object instances in images, was first proposed by Hariharan et al [11] and since has attracted tremendous amount of attention in computer vision due to its importance.
  • Different from image instance segmentation, the new problem aims at simultaneous detection, segmentation and tracking of object instances in videos.
Highlights
  • Segmentation in images and videos is one of the fundamental problems in computer vision
  • We extend the instance segmentation problem in the image domain to the video domain
  • The main difference between our method and the other track-by-detect baselines is the new tracking branch which is trained end-to-end with the other branches, so that useful information can be shared among multiple tasks
  • We analyze the performance of the baselines. They suffer from a natural disadvantage, that they cannot handle objects appear in the intermediate frames
  • We present a new task named video instance segmentation and an accompany dataset named YouTubeVIS in this work
  • We propose a new method combining single-frame instance segmentation and object tracking, which aims to provide some early explorations towards this task
Methods
  • The authors compare the MaskTrack R-CNN with several baselines on the new dataset YouTube-VIS.
  • The authors first present the information of the dataset splits and implementation details of the method.
  • The authors randomly split the YouTube-VIS dataset into 2, 238 training videos, 302 validation videos and 343 test videos.
  • Each of the validation and test set is guaranteed to have more than 4 instances per category.
  • The authors present results on both the validation set and test set in the results section.
  • The backbone of the network is based on the network structure of ResNet-50-FPN in [12] and the authors use a public implementation [4] which is pretrained on MS
Results
  • SeqTracker does not depend on any visual information and achieves better performance than the other baselines.
  • It is an offline method which requires instance segmentation results to be precomputed for all frames.
  • The result is shown in Table 4 with “Identity Oracle”
  • It shows that Image Oracle achieves much better performance than Identity Oracle, which means image-level predictions is critical for better performance on video instance segmentation.
  • Even with image-level ground truth, it is still challenging to associate objects across frames due to object occlusions and fast motion
Conclusion
  • The authors present a new task named video instance segmentation and an accompany dataset named YouTubeVIS in this work.
  • The new tasks is a combination of object detection, segmentation, and tracking, which poses specific challenges given the rich and complex scenes.
  • The authors propose a new method combining single-frame instance segmentation and object tracking, which aims to provide some early explorations towards this task.
  • There are a few interesting future directions: object proposal and detection with spatialtemporal features, end-to-end trainable matching criterion, and incorporating motion information for better recognition and identity association.
  • The authors believe the new task and new algorithm will innovate the research community on new research ideas and directions for video understanding
Summary
  • Introduction:

    Segmentation in images and videos is one of the fundamental problems in computer vision.
  • The task of instance segmentation, i.e. simultaneous detection and segmentation of object instances in images, was first proposed by Hariharan et al [11] and since has attracted tremendous amount of attention in computer vision due to its importance.
  • Different from image instance segmentation, the new problem aims at simultaneous detection, segmentation and tracking of object instances in videos.
  • Methods:

    The authors compare the MaskTrack R-CNN with several baselines on the new dataset YouTube-VIS.
  • The authors first present the information of the dataset splits and implementation details of the method.
  • The authors randomly split the YouTube-VIS dataset into 2, 238 training videos, 302 validation videos and 343 test videos.
  • Each of the validation and test set is guaranteed to have more than 4 instances per category.
  • The authors present results on both the validation set and test set in the results section.
  • The backbone of the network is based on the network structure of ResNet-50-FPN in [12] and the authors use a public implementation [4] which is pretrained on MS
  • Results:

    SeqTracker does not depend on any visual information and achieves better performance than the other baselines.
  • It is an offline method which requires instance segmentation results to be precomputed for all frames.
  • The result is shown in Table 4 with “Identity Oracle”
  • It shows that Image Oracle achieves much better performance than Identity Oracle, which means image-level predictions is critical for better performance on video instance segmentation.
  • Even with image-level ground truth, it is still challenging to associate objects across frames due to object occlusions and fast motion
  • Conclusion:

    The authors present a new task named video instance segmentation and an accompany dataset named YouTubeVIS in this work.
  • The new tasks is a combination of object detection, segmentation, and tracking, which poses specific challenges given the rich and complex scenes.
  • The authors propose a new method combining single-frame instance segmentation and object tracking, which aims to provide some early explorations towards this task.
  • There are a few interesting future directions: object proposal and detection with spatialtemporal features, end-to-end trainable matching criterion, and incorporating motion information for better recognition and identity association.
  • The authors believe the new task and new algorithm will innovate the research community on new research ideas and directions for video understanding
Tables
  • Table1: High level statistics of YouTubeVIS and previous video object segmentation datasets. YTO, YTVOS, and YTVIS stands for YouTubeObjects, YouTubeVOS, and YouTube-VIS respectively
  • Table2: Quantitative evaluation of the proposed algorithm and baselines on the YouTube-VIS validation and test set. The best results are highlighted in bold
  • Table3: Ablation study of our method on the YouTube-VIS validation set. “Det”, “IoU”, and “Cat” denote the detection confidence, the bounding box IoU, and the category consistency in Equation 3 respectively. Numbers in brackets shows the difference compared to the complete score
  • Table4: Oracle results in two settings on validation set. Image oracle is results with predicted object identity based on ground truth image-level annotations, identity oracle is results with ground truth object identities based on predicted image-level instances
Download tables as Excel
Related work
  • Although video instance segmentation has been largely neglected in the literature, several related tasks have been well studied such as image instance segmentation, video object tracking, video object detection, video semantic segmentation and video object segmentation.

    Image Instance Segmentation Instance segmentation not only group pixels into different semantic classes, but also group them into different object instances [11]. A two-stage paradigm is usually adopted, which first generate object proposals using a Region Proposal Network (RPN) [24], and then predict object bounding boxes and masks using aggregated RoI features[8, 15, 12]. The proposed video instance segmentation not only requires segmenting object instances in each frame, but also determining the correspondence of objects across frames.

    Video Object Tracking Video object tracking has two different settings. One is the detection-based tracking which simultaneously detect and track video objects. Methods [26, 32, 28] under this setting usually take the “trackingby-detection” strategy. The other setting is the detectionfree tracking [1, 19, 9], which targets at tracking objects given their initial bounding boxes in the first frame. Among the two settings, DBT is more similar to our problem as it also requires a detector. However, DBT only requires to produce bounding boxes, which is different from our task.
Funding
  • Without any of them, AP will drop around 5%
Reference
  • Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Erik Bochinski, Volker Eiselein, and Thomas Sikora. Highspeed tracking-by-detection without using image information. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–6. IEEE, 2017. 6
    Google ScholarLocate open access versionFindings
  • S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixe, D. Cremers, and L. Van Gool. One-shot video object segmentation. In CVPR, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. mmdetection. https://github.com/open-mmlab/mmdetection, 2018.5
    Findings
  • Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, and Luc Van Gool. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1189–1198, 2018. 2
    Google ScholarLocate open access versionFindings
  • J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segflow: Joint learning for video object segmentation and optical flow. In IEEE International Conference on Computer Vision (ICCV), 2017. 2
    Google ScholarLocate open access versionFindings
  • Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 2
    Google ScholarFindings
  • Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, pages 3038–3046, 2017. 2
    Google ScholarLocate open access versionFindings
  • Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S. Huang. Seq-nms for video object detection. CoRR, abs/1602.08465, 2016. 6
    Findings
  • Bharath Hariharan, Pablo Arbelaez, Ross Girshick, and Jitendra Malik. Simultaneous detection and segmentation. In ECCV, 2014. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollr, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 2, 3, 4, 5
    Google ScholarLocate open access versionFindings
  • Suyog Dutt Jain and Kristen Grauman. Supervoxelconsistent foreground propagation in video. In ECCV, 2014. 4
    Google ScholarLocate open access versionFindings
  • Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In CVPR, pages 2117–2126, 2017. 2
    Google ScholarLocate open access versionFindings
  • Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, July 2017. 2
    Google ScholarLocate open access versionFindings
  • Yule Li, Jianping Shi, and Dahua Lin. Low-latency video semantic segmentation. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 3, 6
    Google ScholarLocate open access versionFindings
  • Anton Milan, Laura Leal-Taixe, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016. 3
    Findings
  • Hyeonseob Nam and Bohyung Han. Learning multi-domain convolutional neural networks for visual tracking. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 36(6):1187–1200, 2013. 4
    Google ScholarLocate open access versionFindings
  • F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A.Sorkine-Hornung. Learning video object segmentation from static images. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016. 1, 4
    Google ScholarLocate open access versionFindings
  • Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbelaez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017. 1, 2, 4
    Findings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015. 2, 4
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. 2
    Google ScholarLocate open access versionFindings
  • Amir Sadeghian, Alexandre Alahi, and Silvio Savarese. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • Evan Shelhamer, Kate Rakelly, Judy Hoffman, and Trevor Darrell. Clockwork convnets for video semantic segmentation. In ECCV workshops, pages 852–868, 2016. 2
    Google ScholarLocate open access versionFindings
  • Jeany Son, Mooyeol Baek, Minsu Cho, and Bohyung Han. Multi-object tracking with quadruplet convolutional neural networks. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. CoRR, abs/1612.07217, 2016. 2
    Findings
  • Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning video object segmentation with visual memory. In ICCV, pages 4481–4490, 2017. 2
    Google ScholarLocate open access versionFindings
  • Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. Feelvos: Fast end-to-end embedding learning for video object segmentation. In CVPR, 2019. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In ICIP. IEEE, 2017. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Jifeng Dai Lu Yuan Yichen Wei Xizhou Zhu, Yujie Wang. Flow-guided feature aggregation for video object detection. 2017. 2
    Google ScholarFindings
  • Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas S. Huang. Youtube-vos: A large-scale video object segmentation benchmark. CoRR, abs/1809.03327, 2018. 2, 3, 4
    Findings
  • Linjie Yang, Yanran Wang, Xuehan Xiong, Jianchao Yang, and Aggelos K. Katsaggelos. Efficient video object segmentation via network modulation. In CVPR, June 2018. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In CVPR, pages 2349–2358, 2017. 2
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments