Clockwork Convnets for Video Semantic Segmentation

ECCV Workshops, pp. 852-868, 2016.

Cited by: 100|Bibtex|Views143|DOI:https://doi.org/10.1007/978-3-319-49409-8_69
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We evaluate our clockwork Fully convolutional networks on four video semantic segmentation datasets

Abstract:

Recent years have seen tremendous progress in still-image segmentation; however the naive application of these state-of-the-art algorithms to every video frame requires considerable computation and ignores the temporal continuity inherent in video. We propose a video recognition framework that relies on two key observations: (1) while pix...More

Code:

Data:

0
Introduction
  • Semantic segmentation is a central visual recognition task. End-to-end convolutional network approaches have made progress on the accuracy and execution time of stillimage semantic segmentation, but video semantic segmentation has received less attention.
  • Convolutional networks (FCNs) [1,2,3] have been shown to obtain remarkable results, but the execution time of repeated per-frame processing limits application to video.
  • Adapting these networks to make use of the temporal continuity of video reduces inference computation while suffering minimal loss in recognition accuracy.
  • The execution of a stage on a given frame is determined by either a fixed clock rate (“fixed-rate”)
Highlights
  • Semantic segmentation is a central visual recognition task
  • We propose that network execution can be viewed as an aspect of architecture and define the “clockwork” Fully convolutional networks (FCNs) (c.f. clockwork recurrent networks [4])
  • In our experiments we report two common metrics for semantic segmentation that measure the region intersection over union (IU):
  • We evaluate our clockwork FCN on four video semantic segmentation datasets
  • We select the 736 image subset of the PASCAL VOC 2011 segmentation validation set used for FCN-8s validation in [1]
Methods
  • The authors' adaptive clock updates the full network on only 26% of the frames, determined by the threshold θ = 0.25 on the proportional output label change across frames, while scheduling updates based on pixel difference alone results in updating 90% of the frames.
  • While the pixel difference metric is susceptible to the changes in Pixel Diff Clock Updates minor image statistics from frame to frame, resulting in very frequent updates, the method only updates during periods of semantic change and can cache deep features with minimal loss in segmentation accuracy: compare adaptive clock segmentations to ground truth
Results
  • The authors' base network is FCN-8s, the fully convolutional network of [1]. The architecture is adapted from the VGG16 architecture [12] and fine-tuned from ILSVRC pre-training.
  • The authors evaluate the clockwork FCN on four video semantic segmentation datasets.
  • Synthetic sequences of translated scenes The authors first validate the method by evaluating on synthetic videos of moving crops of PASCAL VOC images [5] in order to score on a ground truth annotation at every frame.
  • The authors select the 736 image subset of the PASCAL VOC 2011 segmentation validation set used for FCN-8s validation in [1].
  • A “fast” and “slow” video is made with 32 pixel and 16 pixel frame-to-frame displacements respectively
Conclusion
  • Generalized clockwork architectures encompass many kinds of temporal networks, and incorporating execution into the architecture opens up many strategies for scheduling computation.
  • Datadriven clock rates the network is scheduled online to segment dynamic and static scenes alike while maintaining accuracy.
  • In this way the adaptive clockwork network is a bridge between convnets and event-driven vision architectures.
  • The clockwork perspective on temporal networks suggests further architectural variations for spatiotemporal video processing
Summary
  • Introduction:

    Semantic segmentation is a central visual recognition task. End-to-end convolutional network approaches have made progress on the accuracy and execution time of stillimage semantic segmentation, but video semantic segmentation has received less attention.
  • Convolutional networks (FCNs) [1,2,3] have been shown to obtain remarkable results, but the execution time of repeated per-frame processing limits application to video.
  • Adapting these networks to make use of the temporal continuity of video reduces inference computation while suffering minimal loss in recognition accuracy.
  • The execution of a stage on a given frame is determined by either a fixed clock rate (“fixed-rate”)
  • Methods:

    The authors' adaptive clock updates the full network on only 26% of the frames, determined by the threshold θ = 0.25 on the proportional output label change across frames, while scheduling updates based on pixel difference alone results in updating 90% of the frames.
  • While the pixel difference metric is susceptible to the changes in Pixel Diff Clock Updates minor image statistics from frame to frame, resulting in very frequent updates, the method only updates during periods of semantic change and can cache deep features with minimal loss in segmentation accuracy: compare adaptive clock segmentations to ground truth
  • Results:

    The authors' base network is FCN-8s, the fully convolutional network of [1]. The architecture is adapted from the VGG16 architecture [12] and fine-tuned from ILSVRC pre-training.
  • The authors evaluate the clockwork FCN on four video semantic segmentation datasets.
  • Synthetic sequences of translated scenes The authors first validate the method by evaluating on synthetic videos of moving crops of PASCAL VOC images [5] in order to score on a ground truth annotation at every frame.
  • The authors select the 736 image subset of the PASCAL VOC 2011 segmentation validation set used for FCN-8s validation in [1].
  • A “fast” and “slow” video is made with 32 pixel and 16 pixel frame-to-frame displacements respectively
  • Conclusion:

    Generalized clockwork architectures encompass many kinds of temporal networks, and incorporating execution into the architecture opens up many strategies for scheduling computation.
  • Datadriven clock rates the network is scheduled online to segment dynamic and static scenes alike while maintaining accuracy.
  • In this way the adaptive clockwork network is a bridge between convnets and event-driven vision architectures.
  • The clockwork perspective on temporal networks suggests further architectural variations for spatiotemporal video processing
Tables
  • Table1: The average temporal difference over all YouTube-Objects videos of the respective pixelwise class score outputs from a spectrum of network layers. The deeper layers are more stable across frames – that is, we observe supervised convnet features to be “slow” features [33]. The temporal difference is measured as the proportion of label changes in the output. The layer depth counts the distance from the input in the number of parametric and non-linear layers. Semantic accuracy is the intersection-overunion metric on PASCAL VOC of our frame processing network fine-tuned for separate output predictions (Section 5)
  • Table2: The standard recurrent network (SRN), clockwork recurrent network (clock
  • Table3: Pipelined segmentation of translated PASCAL sequences. Synthesized video of translating PASCAL scenes allows for assessment of the pipeline at every frame. The pipelined FCN segments with higher accuracy in the same time envelope as the every-other-frame evaluation of the full FCN. Metrics are computed on the standard masks and a 10-pixel band at boundaries
  • Table4: Pipelined execution of semantic segmentation on three different datasets. Inference approaches include pipelines of different lengths and a full FCN frame oracle. We also show baselines with comparable latency to the pipeline architectures. Our pipelined network offers the best accuracy of computationally comparable approaches running near frame rate. The loss in accuracy relative to the frame oracle is less than the relative speed-up
  • Table5: Fixed-rate segmentation of translated PASCAL sequences. We evaluate the network on synthesized video of translating PASCAL scenes to assess the effect of persisting layer features across frames. Metrics are computed on the standard masks and a 10-pixel band at boundaries
  • Table6: Fixed-rate and adaptive clockwork FCN evaluation. We score our network on three datasets with an alternating schedule that executes the later stage every other frame and an adaptive schedule that executes according to a frame-by-frame threshold on the difference in output. The adaptative threshold is tuned to execute the full network on 50% of frames to equalize computation between the alternating and adaptive schedules
Download tables as Excel
Related work
  • We extend fully convolutional networks for image semantic segmentation to video semantic segmentation. Convnets have been applied to video to learn spatiotemporal representations for classification and detection but rarely for dense pixelwise, frame-byframe inference. Practicality requires network acceleration, but generic techniques do not exploit the structure of video. There is a large body of work on video segmentation, but the focus has not been on semantic segmentation, nor are methods computationally feasible beyond short video shots.

    Fully Convolutional Networks A fully convolutional network (FCN) is a model designed for pixelwise prediction [1]. Every layer in an FCN computes a local operation, such as convolution or pooling, on relative spatial coordinates. This locality makes the network capable of handling inputs of any size while producing output of corresponding dimensions. Efficiency is preserved by computing single, dense forward inference and backward learning passes. Current classification architectures – AlexNet [10], GoogLeNet [11], and VGG [12] – can be cast into corresponding fully convolutional forms. These networks are learned end-to-end, are fast at inference and learning time, and can be generalized with respect to different image-to-image tasks. FCNs yield state-of-the-art results for semantic segmentation [1], boundary prediction [13], and monocular depth estimation [2]. While these tasks process each image in isolation, FCNs extend to video. As more and more visual data is captured as video, the baseline efficiency of fully convolutional computation will not suffice.
Funding
  • Our 3-stage pipeline FCN reduces latency by 59%
Study subjects and analysis
video semantic segmentation datasets: 4
j nji − nii for nij the number of pixels of class i predicted to belong to class j, where there are ncl different classes, and for ti = j nij the total number of pixels of class i. We evaluate our clockwork FCN on four video semantic segmentation datasets. Synthetic sequences of translated scenes We first validate our method by evaluating on synthetic videos of moving crops of PASCAL VOC images [5] in order to score on a ground truth annotation at every frame

datasets with an alternating schedule that executes the later: 3
Fixed-rate segmentation of translated PASCAL sequences. We evaluate the network on synthesized video of translating PASCAL scenes to assess the effect of persisting layer features across frames. Metrics are computed on the standard masks and a 10-pixel band at boundaries. Fixed-rate and adaptive clockwork FCN evaluation. We score our network on three datasets with an alternating schedule that executes the later stage every other frame and an adaptive schedule that executes according to a frame-by-frame threshold on the difference in output. The adaptative threshold is tuned to execute the full network on 50% of frames to equalize computation between the alternating and adaptive schedules. Our adaptive clockwork method illustrated with the famous The Horse in MotionVid[eo9F]ra,mecs aptured by Eadweard Muybridge in 1878 at the Palo Alto racetrack. The clock controls network execution: past the first stage, computation is scheduled only at the time points indicated by the clock symbol. During static scenes cached representations persist, while during dynamic scenes new computations are scheduled and output is coSemgmbenitantioen d with cached representations

Reference
  • Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. In: PAMI. (2016)
    Google ScholarFindings
  • Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV. (2015) 2650–2658
    Google ScholarFindings
  • Fischer, P., Dosovitskiy, A., Ilg, E., Hausser, P., Hazrbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Learning optical flow with convolutional networks. In: ICCV. (2015)
    Google ScholarFindings
  • Koutnık, J., Greff, K., Gomez, F., Schmidhuber, J.: A Clockwork RNN. In: ICML. (2014)
    Google ScholarFindings
  • Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV 88(2) (June 2010) 303–338 6. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV. (2012)
    Google ScholarLocate open access versionFindings
  • 7. Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectors from weakly annotated video. In: CVPR, IEEE (2012) 3282–3289 8. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR. (2016)
    Google ScholarLocate open access versionFindings
  • 9. Muybridge, E.: The horse in motion. Library of Congress Prints and Photographs Division (1882)
    Google ScholarFindings
  • 10. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. (2012)
    Google ScholarFindings
  • 11. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)
    Google ScholarFindings
  • 12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR. (2015)
    Google ScholarFindings
  • 13. Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV. (2015)
    Google ScholarFindings
  • 14. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. PAMI 35(1) (2013) 221–231 15. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR. (2014) 1725–1732 16.
    Google ScholarLocate open access versionFindings
  • Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8) (1997) 1735–1780
    Google ScholarLocate open access versionFindings
  • 17. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR. (2015) 2625–2634
    Google ScholarFindings
  • 18. Laptev, I.: On space-time interest points. IJCV 64(2-3) (2005) 107–123 19. Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR. (2016)
    Google ScholarLocate open access versionFindings
  • 20. He, K., Sun, J.: Convolutional neural networks at constrained time cost. In: CVPR. (2015)
    Google ScholarFindings
  • 21. Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on cpus. In: Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop. Volume 1. (2011)
    Google ScholarLocate open access versionFindings
  • 22. Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear structure within convolutional networks for efficient evaluation. In: NIPS. (2014) 1269–1277 23. Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: BMVC. (2014)
    Google ScholarLocate open access versionFindings
  • 24. Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., Wang, Z.: Deep fried convnets. In: ICCV. (2015)
    Google ScholarFindings
  • 25. Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: CVPR, IEEE (2010) 2141–2148 26. Xu, C., Corso, J.J.: Evaluation of super-voxel methods for early video processing. In: CVPR, IEEE (2012) 1202–1209 27.
    Google ScholarLocate open access versionFindings
  • (1998) 1154–1160 28.
    Google ScholarFindings
  • (December 2013) 29.
    Google ScholarFindings
  • Fragkiadaki, K., Arbelaez, P., Felsen, P., Malik, J.: Learning to segment moving objects in videos. In: CVPR. (June 2015) 30.
    Google ScholarFindings
  • Hartmann, G., Grundmann, M., Hoffman, J., Tsai, D., Kwatra, V., Madani, O., Vijayanarasimhan, S., Essa, I., Rehg, J., Sukthankar, R.: Weakly supervised learning of object segmentations from web-scale video. In: ECCV-W, Springer (2012) 198–208 31.
    Google ScholarFindings
  • Tang, K., Sukthankar, R., Yagnik, J., Fei-Fei, L.: Discriminative segment annotation in weakly labeled video. In: CVPR, IEEE (2013) 2483–2490 32.
    Google ScholarFindings
  • Liu, X., Tao, D., Song, M., Ruan, Y., Chen, C., Bu, J.: Weakly supervised multiclass video segmentation. In: CVPR, IEEE (2014) 57–64 33.
    Google ScholarLocate open access versionFindings
  • Wiskott, L., Sejnowski, T.J.: Slow feature analysis: Unsupervised learning of invariances. Neural computation 14(4) (2002) 715–770 34.
    Google ScholarFindings
  • Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR. (2016)
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments