AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically

Go Generating

AI Traceability

AI parses the academic lineage of this thesis

Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper

To deal with a varying number of output labels per image, we develop a predictive model in the space of fixed-sized convolutional features of the Mask R-convolutional neural network instance segmentation model

Predicting Future Instance Segmentations by Forecasting Convolutional Features.

ECCV, (2018): 593-608

Cited by: 51|Views157


Anticipating future events is an important prerequisite towards intelligent behavior. Video forecasting has been studied as a proxy task towards this goal. Recent work has shown that to predict semantic segmentation of future frames, forecasting at the semantic level is more effective than forecasting RGB frames and then segmenting these....More



  • The ability to anticipate future events is a key factor towards developing intelligent behavior [2].
  • Predictive models have important applications in decision-making contexts, such as autonomous driving, where rapid control decisions can be of vital importance [7,8]
  • In such contexts, the goal is not to predict the raw RGB values of future video frames, but to make predictions about future video frames at a semantically meaningful level, e.g. in terms of presence and location of object categories in a scene.
  • Luc et al . [1] recently showed that for prediction of
  • The ability to anticipate future events is a key factor towards developing intelligent behavior [2]
  • Video prediction has been studied as a proxy task towards pursuing this ability, which can capitalize on the huge amount of available unlabeled video to learn visual representations that account for object interactions and interactions between objects and the environment [3]
  • Our work demonstrates the scalability of convolutional neural network (CNN) feature prediction, from 4K-dimensional to 32M-dimensional features, and yields results with a surprising level of accuracy and spatial detail
  • We introduced a new anticipated recognition task: predicting instance segmentation of future video frames
  • We predict the internal “backbone” features which are of fixed dimension, and apply the “detection head” on these features to produce a variable number of predictions
  • Our results show that future instance segmentation can be predicted much better than naively copying the segmentations from the last observed frame, and that our future feature prediction approach significantly outperforms two strong baselines, the first one relying on optical-flow-based warping and the second on repurposing and fine-tuning the Mask R-CNN architecture for the task
  • The authors use the Cityscapes dataset [25] which contains 2,975 train, 500 validation and 1,525 test video sequences of 1.8 second each, recorded from a car driving in urban environments.
  • Ground truth semantic and instance segmentation annotations are available for the 20-th frame of each sequence.
  • The authors employ a Mask R-CNN model pre-trained on the MS-COCO dataset [26] and fine-tune it in an end-to-end fashion on the Cityscapes dataset, using a ResNet-50-FPN backbone.
  • The coarsest FPN level P5 has resolution 32×64, and the finest level P2 has resolution 256×512
  • The Mask H2F baseline frequently predicts several masks around objects, especially for objects with ambiguous trajectories, like pedestrians, and less so for more predictable categories like cars
  • The authors speculate that this is due to the loss that the network is optimizing, which does not discourage this behavior, and due to which the network is learning to predict several plausible future positions, as long as they overlap sufficiently with the ground-truth position.
  • The predicted masks are much more precise than those of the S2S model, which is not instance-aware
  • Certain motions and shape transformations are hard to predict accurately due to the inherent ambiguity in the problem
  • This is, e.g., the case for the legs of pedestrians in Fig. 7(b), for which there is a high degree of uncertainty on the exact pose.
  • Since the model is deterministic, it predicts a rough mask due to averaging over several possibilities
  • This may be addressed by modeling the intrinsic variability using GANs, VAEs, or autoregressive models [6,32,33].The authors introduced a new anticipated recognition task: predicting instance segmentation of future video frames.
  • When evaluated on the more basic task of semantic segmentation without instance-level detail, the approach yields performance quantitatively comparable to earlier approaches, while having qualitative advantages
  • Table1: Ablation study: short-term prediction on the Cityscapes val. set
  • Table2: Instance segmentation accuracy on the Cityscapes validation set. * Separate models were trained for short-term and mid-term predictions
  • Table3: Short and mid-term semantic segmentation of moving objects (8 classes) performance on the Cityscapes validation set. * Separate models were trained for short-term and mid-term predictions
Download tables as Excel
Related work
  • Future video prediction. Predictive modeling of future RGB video frames has recently been studied using a variety of techniques, including autoregressive models [6], adversarial training [3], and recurrent networks [4,5,11]. Villegas et al . [12] predict future human poses as a proxy to guide the prediction of future RGB video frames. Instead of predicting RGB values, Walker et al . [13] predict future pixel trajectories from static images.

    Future prediction of more abstract representations has been considered in a variety of contexts in the past. Lan et al . [14] predict future human actions from automatically detected atomic actions. Kitani et al . [15] predict future trajectories of people from semantic segmentation of an observed video frame, modeling potential destinations and transitory areas that are preferred or avoided. Lee et al . predict future object trajectories from past object tracks and object interactions [16]. Dosovitskiy & Koltun [17] learn control models by predicting future high-level measurements in which the goal of an agent can be expressed from past video frames and measurements.
  • This work has been partially supported by the grant ANR-16-CE23-0006 “Deep in France” and LabEx PERSYVAL-Lab (ANR-11LABX-0025-01)
  • Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: ICCV. (2017)
    Google ScholarFindings
  • Sutton, R., Barto, A.: Reinforcement learning: An introduction. MIT Press (1998)
    Google ScholarFindings
  • Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: ICLR. (2016)
    Google ScholarFindings
  • Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv 1412.6604 (2014)
  • Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML. (2015)
    Google ScholarFindings
  • Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K.: Video pixel networks. In: ICML. (2017) 7. Shalev-Shwartz, S., Ben-Zrihem, N., Cohen, A., Shashua, A.: Long-term planning by short-term prediction. arXiv 1602.01580 (2016) 8. Shalev-Shwartz, S., Shashua, A.: On the sample complexity of end-to-end training vs. semantic abstraction training. arXiv 1604.06915 (2016)
  • 9. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV. (2017)
    Google ScholarFindings
  • 10. Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: CVPR. (2017)
    Google ScholarFindings
  • 11. Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: ICLR. (2017)
    Google ScholarFindings
  • 12. Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: ICML. (2017)
    Google ScholarFindings
  • 13. Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: Forecasting from static images using variational autoencoders. In: ECCV. (2016)
    Google ScholarFindings
  • 14. Lan, T., Chen, T.C., Savarese, S.: A hierarchical representation for future action prediction. In: ECCV. (2014)
    Google ScholarFindings
  • 15. Kitani, K., Ziebart, B., Bagnell, J., Hebert, M.: Activity forecasting. In: ECCV. (2012)
    Google ScholarFindings
  • 16. Lee, N., Choi, W., Vernaza, P., Choy, C., Torr, P., Chandraker, M.: DESIRE: distant future prediction in dynamic scenes with interacting agents. In: CVPR. (2017)
    Google ScholarFindings
  • 17. Dosovitskiy, A., Koltun, V.: Learning to act by predicting the future. In: ICLR. (2017)
    Google ScholarFindings
  • 18. Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. In: CVPR. (2016)
    Google ScholarFindings
  • 19. Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., Yan, S.: Predicting scene parsing and motion dynamics in the future. In: NIPS. (2017) 20. Romera-Paredes, B., Torr, P.: Recurrent instance segmentation. In: ECCV. (2016)
    Google ScholarLocate open access versionFindings
  • 21. Bai, M., Urtasun, R.: Deep watershed transform for instance segmentation. In: CVPR. (2017)
    Google ScholarFindings
  • 22. Pinheiro, P., Lin, T.Y., Collobert, R., Dollar, P.: Learning to refine object segments. In: ECCV. (2016)
    Google ScholarFindings
  • 23. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS. (2015)
    Google ScholarFindings
  • 24. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. (2017)
    Google ScholarFindings
  • 25. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR. (2016)
    Google ScholarFindings
  • 26. Lin, T., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.: Microsoft COCO: common objects in context. In: ECCV. (2014)
    Google ScholarLocate open access versionFindings
  • 27. Yang, A., Wright, J., Ma, Y., Sastry, S.: Unsupervised segmentation of natural images via lossy data compression. CVIU 110(2) (2008) 212–225
    Google ScholarLocate open access versionFindings
  • 28. Parntofaru, C., Hebert, M.: A comparison of image segmentation algorithms. Technical Report CMU-RI-TR-05-40, Carnegie Mellon University (2005)
    Google ScholarFindings
  • 29. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: ICCV. (2001)
    Google ScholarFindings
  • 30. Meila, M.: Comparing clusterings: An axiomatic view. In: ICML. (2005)
    Google ScholarFindings
  • 31. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR. (2016)
    Google ScholarFindings
  • 32. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS. (2014)
    Google ScholarFindings
  • 33. Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: ICLR. (2014)
    Google ScholarFindings
  • 34. Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR. (2015)
    Google ScholarFindings
Your rating :