CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Cited by: 0|Views232
Weibo:
We develop a comprehensive approach for feature aggregation for video instance segmentation, which is an underexplored direction in this area

Abstract:

Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video. Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects and they suffer in the video scenario due to several distinct challenges such as motion blur and dra...More
0
Full Text
Bibtex
Weibo
Introduction
Highlights
  • Video instance segmentation (VIS) is a joint task of detection, segmentation and tracking of object instances in videos (Yang, Fan, and Xu 2019)
  • In order to utilize the abundant information in videos and to harvest the benefits of modern object tracking approaches, we propose a comprehensive feature aggregation approach for video instance segmentation, termed CompFeat
  • We propose a comprehensive feature aggregation approach for video instance segmentation, including temporal and spatial attention modules on both frame-level and object-level features
  • We extend this work with a more sophisticated comprehensive feature aggregation approach which greatly boosts the performance on video instance segmentation
  • We develop a comprehensive approach for feature aggregation for video instance segmentation, which is an underexplored direction in this area
  • Spatial attention module can slightly improve the baseline performance by 1%
  • Attention mechanisms are careful crafted for feature aggregations on both frame-level and object-level in both temporal and spatial manner
Methods
  • AP AP0.5 AP0.75

    The authors' Implementation

    The authors' Implementation + MSCOCO (a) Attention on Frame Level

    Baseline + Temporal Attention

    Baseline + Spatial Attention (b) Attention on Object Level (c) Attention on Both Frame and Object Level is close to the result reported in original paper.
  • The authors find that the number of object instances in YouTube-VIS dataset is limited.
  • The authors choose MSCOCO (Lin et al 2014) as external data which has a large overlap on the object categories with YouTube-VIS.
  • The performance after using external data is listed in Table 1 as well.
  • The authors use this model as a baseline model for all the following ablation experiments
Results
  • Fig. 5 shows some qualitative results of the proposed CompFeat on YouTube-VIS validation set.
  • Each row represents the predicted results on different frames in a video.
  • CompFeat makes accurate predictions on object categories, bounding boxes, masks and identities under challenging conditions, i.e. multiple similar objects, moderate occlusions, and drastic appearance changes.
  • The last row shows a challenging case with six fish where the algorithm performs much better than MaskTrack-RCNN (Yang, Fan, and Xu 2019) it misses a fish in the third image
Conclusion
  • The authors develop a comprehensive approach for feature aggregation for video instance segmentation, which is an underexplored direction in this area.
  • Attention mechanisms are careful crafted for feature aggregations on both frame-level and object-level in both temporal and spatial manner.
  • A new tracking module is designed to enhance local discriminative power of features with local and global correlation maps, in order to improve robustness of object tracking and re-identification.
  • The effectiveness of the proposed modules is systematically evaluated with extensive experiments and ablation studies on the YouTube-VIS dataset
Summary
  • Introduction:

    Video instance segmentation (VIS) is a joint task of detection, segmentation and tracking of object instances in videos (Yang, Fan, and Xu 2019).
  • Unlike semi-supervised video object segmentation (Voigtlaender et al 2019a; Voigtlaender and Leibe 2017; Xu et al 2018; Wug Oh et al 2018; Oh et al 2019; Xu et al 2019), video instance segmentation does not require a ground truth mask in the first frame and all objects appear in the video should be processed
  • It has essential applications in many video-based tasks, including video editing, autonomous driving and augmented reality.
  • Methods:

    AP AP0.5 AP0.75

    The authors' Implementation

    The authors' Implementation + MSCOCO (a) Attention on Frame Level

    Baseline + Temporal Attention

    Baseline + Spatial Attention (b) Attention on Object Level (c) Attention on Both Frame and Object Level is close to the result reported in original paper.
  • The authors find that the number of object instances in YouTube-VIS dataset is limited.
  • The authors choose MSCOCO (Lin et al 2014) as external data which has a large overlap on the object categories with YouTube-VIS.
  • The performance after using external data is listed in Table 1 as well.
  • The authors use this model as a baseline model for all the following ablation experiments
  • Results:

    Fig. 5 shows some qualitative results of the proposed CompFeat on YouTube-VIS validation set.
  • Each row represents the predicted results on different frames in a video.
  • CompFeat makes accurate predictions on object categories, bounding boxes, masks and identities under challenging conditions, i.e. multiple similar objects, moderate occlusions, and drastic appearance changes.
  • The last row shows a challenging case with six fish where the algorithm performs much better than MaskTrack-RCNN (Yang, Fan, and Xu 2019) it misses a fish in the third image
  • Conclusion:

    The authors develop a comprehensive approach for feature aggregation for video instance segmentation, which is an underexplored direction in this area.
  • Attention mechanisms are careful crafted for feature aggregations on both frame-level and object-level in both temporal and spatial manner.
  • A new tracking module is designed to enhance local discriminative power of features with local and global correlation maps, in order to improve robustness of object tracking and re-identification.
  • The effectiveness of the proposed modules is systematically evaluated with extensive experiments and ablation studies on the YouTube-VIS dataset
Tables
  • Table1: Performance of the baseline model on YouTubeVIS validation set. “Our Implementation” means our reproduced results of Mask-Track RCNN.“Our Implementation + MSCOCO” is used as the baseline model in all ablation studies hereafter
  • Table2: Ablation Study of our proposed attention module on YouTube-VIS validation set. The best results are highlighted in bold
  • Table3: Ablation Study of the proposed track module on YouTube-VIS validation set. CM, FDA, BDA denote the proposed tracking module with correlation map, the frame level dual attention module and object level dual attention module, respectively. The best results are highlighted in bold
  • Table4: Performance of different sampling methods and different frames during training/testing on the validation set of Youtube-VIS. The performance is reported in AP
  • Table5: Comparison of the proposed approach with the state-of-the-arts on YouTube-VIS validation set. Note that all results in this
Download tables as Excel
Related work
  • In this section we review video instance segmentation and several closely-related tasks such as video object detection and video object tracking.

    Video Object Detection. Video object detection aims to detect all objects in videos such as shown in the ImageNet VID challenge (Russakovsky et al 2015; Han et al 2016). Feature aggregation is widely used in video detection (Zhu et al 2017; Feichtenhofer, Pinz, and Zisserman 2017; Chen et al 2018; Liu et al 2019). For instance, Zhu et al proposed to aggregate features from nearby frames to enhance the feature quality of an input frame. However, its speed is pretty slow due to the dense detection and optical flow estimation. In (Chen et al 2018), Chen et al proposed to use a scale-time lattice to generate detection on sparse key frames and designed a temporal propagation approach for detection in an effective way. Inspired by these work, we propose to improve the feature quality for video instance segmentation via feature aggregation using attention mechanism.
Funding
  • With the temporal attention module, we improve baseline model by 1.1% in AP and 1.3% in AP0.5 respectively
  • Spatial attention module can also slightly improve the baseline performance by 1%
  • Comparing with the performance only on frame or object level, the combination one is superior. Specifically, with both attention on two different levels, we achieve AP/AP0.5 = 27.5%/46.1%, which outperforms the performance with attention module on frame level by 1.7%/1.6%. This further improvement shows that by using attention module on frame level and object level, we can aggregate context information in a global-to-local manner, which can greatly improve the baseline model by 3.4%, 3.5% and 4.0% in AP, AP0.5 and AP0.75
  • Compared to the baseline model, the tracking module with correlation map outperforms the baseline model by more than 1% on AP, AP0.5 and AP0.75
  • By using correlation maps and dual attention module on frame level, we improve the performance in AP from 25.8% to 26.3%
  • We achieve 28.4% and 47.4% on AP and AP0.5 with all proposed modules, which surpasses the baseline model by 4.3% and 4.8%
  • And SipMask (Cao et al 2020) shares the similar structure while replace the instance segmentation branch with an one stage instance segmentation module. Compared with these methods, our proposed CompFeat achieves the best performance under all evaluation metrics
Reference
  • Bochinski, E.; Eiselein, V.; and Sikora, T. 2017. Highspeed tracking-by-detection without using image information. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 1–6. IEEE.
    Google ScholarLocate open access versionFindings
  • Cao, J.; Anwer, R. M.; Cholakkal, H.; Khan, F. S.; Pang, Y.; and Shao, L. 2020. SipMask: Spatial Information Preservation for Fast Instance Segmentation. Proc. European Conference on Computer Vision.
    Google ScholarLocate open access versionFindings
  • Cao, Y.; Xu, J.; Lin, S.; Wei, F.; and Hu, H. 2019. GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond. arXiv preprint arXiv:1904.11492.
    Findings
  • Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. 2019. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv preprint arXiv:1906.07155.
    Findings
  • Chen, K.; Wang, J.; Yang, S.; Zhang, X.; Xiong, Y.; Change Loy, C.; and Lin, D. 2018. Optimizing video object detection via a scale-time lattice. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7814–7823.
    Google ScholarLocate open access versionFindings
  • Dong, M.; Wang, J.; Huang, Y.; Yu, D.; Su, K.; Zhou, K.; Shao, J.; Wen, S.; and Wang, C. 2019. Temporal Feature Augmented Network for Video Instance Segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops.
    Google ScholarLocate open access versionFindings
  • Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 201Detect to track and track to detect. In IEEE ICCV.
    Google ScholarLocate open access versionFindings
  • Han, W.; Khorrami, P.; Paine, T. L.; Ramachandran, P.; Babaeizadeh, M.; Shi, H.; Li, J.; Yan, S.; and Huang, T. S. 2016. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465.
    Findings
  • Hariharan, B.; Arbelaez, P.; Girshick, R.; and Malik, J. 2014. Simultaneous detection and segmentation. In ECCV.
    Google ScholarLocate open access versionFindings
  • He, K.; Gkioxari, G.; Dollar, P.; and Girshick, R. 2017. Mask r-cnn. In IEEE ICCV.
    Google ScholarLocate open access versionFindings
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In IEEE CVPR, 770–778.
    Google ScholarLocate open access versionFindings
  • Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; and Yan, J. 2019. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; and Belongie, S. 2017. Feature pyramid networks for object detection. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; and Zitnick, C. L. 20Microsoft coco: Common objects in context. In ECCV.
    Google ScholarLocate open access versionFindings
  • Liu, M.; Zhu, M.; White, M.; Li, Y.; and Kalenichenko, D. 2019. Looking Fast and Slow: Memory-Guided Mobile Video Object Detection. arXiv preprint arXiv:1903.10172.
    Findings
  • Luiten, J.; Torr, P.; and Leibe, B. 2019. Video Instance Segmentation 2019: A winning approach for combined Detection, Segmentation, Classification and Tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops.
    Google ScholarLocate open access versionFindings
  • Oh, S. W.; Lee, J.-Y.; Xu, N.; and Kim, S. J. 2019. Video object segmentation using space-time memory networks. arXiv preprint arXiv:1904.00607.
    Findings
  • Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision.
    Google ScholarFindings
  • Sadeghian, A.; Alahi, A.; and Savarese, S. 2017. Tracking the untrackable: Learning to track multiple cues with longterm dependencies. In IEEE ICCV.
    Google ScholarLocate open access versionFindings
  • Shi, H. 2018. Geometry-aware traffic flow analysis by detection and tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 116– 120.
    Google ScholarLocate open access versionFindings
  • Son, J.; Baek, M.; Cho, M.; and Han, B. 2017. Multi-object tracking with quadruplet convolutional neural networks. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; and Torr, P. H. 2017. End-to-end representation learning for correlation filter based tracking. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Voigtlaender, P.; Chai, Y.; Schroff, F.; Adam, H.; Leibe, B.; and Chen, L.-C. 2019a. Feelvos: Fast end-to-end embedding learning for video object segmentation. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Voigtlaender, P.; Krause, M.; Osep, A.; Luiten, J.; Sekar, B. B. G.; Geiger, A.; and Leibe, B. 2019b. MOTS: Multi-object tracking and segmentation. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Voigtlaender, P.; and Leibe, B. 2017. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364.
    Findings
  • Wang, Q.; He, Y.; Yang, X.; Yang, Z.; and Torr, P. 2019. An Empirical Study of Detection-Based Video Instance Segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops.
    Google ScholarLocate open access versionFindings
  • Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; Hu, W.; and Maybank, S. 2018a. Learning attentions: residual attentional siamese network for high performance online visual tracking. In IEEE ICCV.
    Google ScholarLocate open access versionFindings
  • Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018b. Nonlocal neural networks. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Wojke, N.; Bewley, A.; and Paulus, D. 2017. Simple online and realtime tracking with a deep association metric. In IEEE International Conference on Image Processing.
    Google ScholarLocate open access versionFindings
  • Wu, C.-Y.; Feichtenhofer, C.; Fan, H.; He, K.; Krahenbuhl, P.; and Girshick, R. 2019. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 284– 293.
    Google ScholarLocate open access versionFindings
  • Wug Oh, S.; Lee, J.-Y.; Sunkavalli, K.; and Joo Kim, S. 2018. Fast video object segmentation by reference-guided mask propagation. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Xu, K.; Wen, L.; Li, G.; Bo, L.; and Huang, Q. 2019. Spatiotemporal CNN for Video Object Segmentation. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Xu, N.; Yang, L.; Fan, Y.; Yue, D.; Liang, Y.; Yang, J.; and Huang, T. 2018. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327.
    Findings
  • Yang, L.; Fan, Y.; and Xu, N. 2019. Video Instance Segmentation. In IEEE ICCV.
    Google ScholarLocate open access versionFindings
  • Yang, L.; Wang, Y.; Xiong, X.; Yang, J.; and Katsaggelos, A. K. 2018. Efficient video object segmentation via network modulation. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Zhang, Z.; and Peng, H. 2019. Deeper and wider siamese networks for real-time visual tracking. In IEEE CVPR.
    Google ScholarLocate open access versionFindings
  • Zhu, X.; Wang, Y.; Dai, J.; Yuan, L.; and Wei, Y. 2017. Flowguided feature aggregation for video object detection. In IEEE ICCV.
    Google ScholarFindings
  • Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; and Hu, W. 2018. Distractor-aware Siamese Networks for Visual Object Tracking. In ECCV.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments