Efficient Video Object Segmentation via Network Modulation

CVPR, pp. 6499-6507, 2018.

Cited by: 175|Views74
EI
Weibo:
To alleviate the slow speed of one-shot fine-tuning developed by previous Fully Convolutional Network-based methods, we propose to use a network modulation approach that mimics the fine-tuning process with one forward pass of the modulator network

Abstract:

Video object segmentation targets segmenting a specific object throughout a video sequence when given only an annotated first frame. Recent deep learning based approaches find it effective to fine-tune a general-purpose segmentation model on the annotated frame using hundreds of iterations of gradient descent. Despite the high accuracy th...More

Code:

Data:

0
Introduction
  • Semantic segmentation plays an important role in understanding the visual content of an image as it assigns predefined objects or scene labels to each pixel and translates the image into a segmentation map.
  • The authors have witnessed a rising amount of interest in developing one-shot learning techniques for video segmentation [2, 23, 35, 22, 32, 4]
  • Most of these works share a similar two-stage paradigm: first, train a general-purpose Fully Convolutional Network (FCN) [31] to segment the foreground object; second, fine-tune this network based on the first frame of the video for several hundred forwardbackward iterations to adapt the model to the specific video sequence.
  • Some of these approaches [4] [23] utilize optical flow information, which is computationally heavy for state-of-the-art algorithms[29] [15]
Highlights
  • Semantic segmentation plays an important role in understanding the visual content of an image as it assigns predefined objects or scene labels to each pixel and translates the image into a segmentation map
  • We have witnessed a rising amount of interest in developing one-shot learning techniques for video segmentation [2, 23, 35, 22, 32, 4]. Most of these works share a similar two-stage paradigm: first, train a general-purpose Fully Convolutional Network (FCN) [31] to segment the foreground object; second, fine-tune this network based on the first frame of the video for several hundred forwardbackward iterations to adapt the model to the specific video sequence
  • We propose a novel framework to process one-shot video segmentation efficiently
  • To alleviate the slow speed of one-shot fine-tuning developed by previous FCN-based methods, we propose to use a network modulation approach that mimics the fine-tuning process with one forward pass of the modulator network
  • We show in experiments that by injecting a limited number of parameters computed by the modulators, the segmentation model can be repurposed to segment an arbitrary object
  • Another piece of future work would be to learn a recurrent representation of the modulation parameters to manipulate the FCN based on temporal information
Methods
  • Similar objects should have similar modulation parameters, while different objects should have dramatically different modulation parameters.
  • To visualize this embedding, the authors extract modulation parameters from 100 object instances in 10 object classes in MS-COCO, and visualize the parameters in a two-dimensional embedding space using multi-dimensional scaling in Fig. 5.
  • The authors can see that objects in the same category are mostly clustered together, and similar categories are closer to each other than dissimilar categories.
  • Cats and dogs, cars and buses are mixed up due to their similar appearance, while
Results
  • The object is augmented with up to 10% random scaling and 10◦ random rotation. For preprocessing the spatial guide as input to the spatial modulator, the authors first compute the mean and standard deviation of the mask, and augment the mask with up to 20% random shift and 40% random scaling.
  • Compared to deep learning approaches without model fine-tuning, and similar speed as ours, the method achieves the best accuracy on both DAVIS 2016 and YoutubeObjects
Conclusion
  • The authors propose a novel framework to process one-shot video segmentation efficiently.
  • The authors' approach falls into the general category of meta-learning, and it would be worthwhile to investigate other metalearning approaches for video segmentation.
  • Another piece of future work would be to learn a recurrent representation of the modulation parameters to manipulate the FCN based on temporal information
Summary
  • Introduction:

    Semantic segmentation plays an important role in understanding the visual content of an image as it assigns predefined objects or scene labels to each pixel and translates the image into a segmentation map.
  • The authors have witnessed a rising amount of interest in developing one-shot learning techniques for video segmentation [2, 23, 35, 22, 32, 4]
  • Most of these works share a similar two-stage paradigm: first, train a general-purpose Fully Convolutional Network (FCN) [31] to segment the foreground object; second, fine-tune this network based on the first frame of the video for several hundred forwardbackward iterations to adapt the model to the specific video sequence.
  • Some of these approaches [4] [23] utilize optical flow information, which is computationally heavy for state-of-the-art algorithms[29] [15]
  • Methods:

    Similar objects should have similar modulation parameters, while different objects should have dramatically different modulation parameters.
  • To visualize this embedding, the authors extract modulation parameters from 100 object instances in 10 object classes in MS-COCO, and visualize the parameters in a two-dimensional embedding space using multi-dimensional scaling in Fig. 5.
  • The authors can see that objects in the same category are mostly clustered together, and similar categories are closer to each other than dissimilar categories.
  • Cats and dogs, cars and buses are mixed up due to their similar appearance, while
  • Results:

    The object is augmented with up to 10% random scaling and 10◦ random rotation. For preprocessing the spatial guide as input to the spatial modulator, the authors first compute the mean and standard deviation of the mask, and augment the mask with up to 20% random shift and 40% random scaling.
  • Compared to deep learning approaches without model fine-tuning, and similar speed as ours, the method achieves the best accuracy on both DAVIS 2016 and YoutubeObjects
  • Conclusion:

    The authors propose a novel framework to process one-shot video segmentation efficiently.
  • The authors' approach falls into the general category of meta-learning, and it would be worthwhile to investigate other metalearning approaches for video segmentation.
  • Another piece of future work would be to learn a recurrent representation of the modulation parameters to manipulate the FCN based on temporal information
Tables
  • Table1: Performance comparison of our approach with recent approaches on DAVIS 2016 and YoutubeObjects. Performance measured in mean IU
  • Table2: Comparisons of our approach and two state-of-the-art algorithm on DAVIS 2017 validation set
  • Table3: Ablation study of our method on DAVIS 2017
Download tables as Excel
Related work
  • Semi-supervised video segmentation. Semi-supervised video object segmentation aims at tracking an object mask from the first annotated frame throughout the rest of video. Many approaches have been proposed in recent literature, including those propagating superpixels [17] [35], patches [9], object proposals [25], or in bilateral space [22], and graphical model based optimization is usually performed to consider multiple frames simultaneously. With the success of FCN on static image segmentation [12], deep learning based methods [23, 2, 32, 34, 18, 4] have been recently proposed for video segmentation and promising results have been achieved. To model the temporal motion information, some works heavily rely on optical flow [34] [4], and use CNNs to learn mask refinement of an object from current frame to the next one [23], or combine the training of CNN with bilateral filtering between adjacent frames [18]. Chen et al [4] use a CNN to jointly estimate the optical flow and provide the learned motion representation to generate motion consistent segmentation across time. Different from these approaches, Caelles et al [2] combine offline and online training process on static images without using temporal information. While it saves the computation of optical flow and/or conditional random fields (CRF) [19] involved in some previous methods, online fine-tuning still requires many iterations of optimization, which poses a challenge for real-world applications that need rapid inference.
Funding
  • The object is also augmented with up to 10% random scaling and 10◦ random rotation
  • For preprocessing the spatial guide as input to the spatial modulator, we first compute the mean and standard deviation of the mask, and then augment the mask with up to 20% random shift and 40% random scaling
  • Compared to deep learning approaches without model fine-tuning, and therefore, similar speed as ours, our method achieves the best accuracy on both DAVIS 2016 and YoutubeObjects
Reference
  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. 5
    Google ScholarFindings
  • S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixe, D. Cremers, and L. Van Gool. One-shot video object segmentation. In CVPR, 2017. 1, 2, 4, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. de Freitas. Learning to learn without gradient descent by gradient descent. In ICML, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segflow: Joint learning for video object segmentation and optical flow. In IEEE International Conference on Computer Vision (ICCV), 2017. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • H. de Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. C. Courville. Modulating early visual processing by language. CoRR, abs/1707.00683, 2017. 3
    Findings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 4
    Google ScholarLocate open access versionFindings
  • V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style, 2017. 2
    Google ScholarFindings
  • Q. Fan, F. Zhong, D. Lischinski, D. Cohen-Or, and B. Chen. Jumpcut:non-successive mask transfer and interpolation for video cutout. In ACM Trans. Graph., 34(6), 2015. 2
    Google ScholarLocate open access versionFindings
  • C. Finn, P. Abbeel, and S. Levine. Model-agnostic metalearning for fast adaptation of deep networks. In ICML, 2017. 2
    Google ScholarLocate open access versionFindings
  • G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens. exploring the structure of a real-time, arbitrary neural artistic stylization network. In BMVC, 2017. 2
    Google ScholarLocate open access versionFindings
  • B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015. 2, 4
    Google ScholarLocate open access versionFindings
  • B. Hariharan and R. Girshick. Low-shot visual recognition by shrinking and hallucinating features. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. CoRR, abs/1703.06868, 2017. 2, 3
    Findings
  • E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. CVPR, 2017. 1
    Google ScholarLocate open access versionFindings
  • M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, pages 2017–2025, 2015. 2
    Google ScholarLocate open access versionFindings
  • S. D. Jain and K. Grauman. Supervoxel-consistent foreground propagation in video. In ECCV, 2014. 2, 5
    Google ScholarLocate open access versionFindings
  • V. Jampani, R. Gadde, and P. V. Gehler. Video propagation networks. In CVPR, 2017. 2, 5, 6
    Google ScholarLocate open access versionFindings
  • P. Krahenbuhl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, pages 109–117, 2011. 2, 5
    Google ScholarLocate open access versionFindings
  • F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figure-ground segments. In ICCV, 2013. 1
    Google ScholarFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 4
    Google ScholarLocate open access versionFindings
  • N. Marki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bilateral space video segmentation. In CVPR, 2016. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A.Sorkine-Hornung. Learning video object segmentation from static images. In CVPR, 2017. 1, 2, 4, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016. 1, 5
    Google ScholarLocate open access versionFindings
  • F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung. Fully connected object proposals for video segmentation. In ICCV, 2015. 2
    Google ScholarFindings
  • E. Perez, H. de Vries, F. Strub, V. Dumoulin, and A. C. Courville. Learning visual reasoning without strong priors. CoRR, abs/1707.03017, 2017. 2, 3
    Findings
  • J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. SorkineHornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017. 4, 5
    Findings
  • S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. ICLR, 2017. 2
    Google ScholarLocate open access versionFindings
  • J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Epicflow: Edge-preserving interpolation of correspondences for optical flow. In CVPR, 2015. 1
    Google ScholarLocate open access versionFindings
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In ICML, 2016. 2
    Google ScholarLocate open access versionFindings
  • E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(4):640–651, 2017. 1
    Google ScholarLocate open access versionFindings
  • J. Shin Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. So Kweon. Pixel-level matching for video object segmentation using convolutional neural networks. In CVPR, 2017. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 3, 4
    Findings
  • P. Tokmakov, K. Alahari, and C. Schmid. Learning video object segmentation with visual memory. In ICCV, 2017. 2, 5, 6
    Google ScholarLocate open access versionFindings
  • Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation via object flow. In CVPR, 2016. 1, 2, 5, 6
    Google ScholarLocate open access versionFindings
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In NIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014. 8
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments