Predicting Future Instance Segmentations by Forecasting Convolutional Features
ECCV, pp. 593-608, 2018.
EI
Weibo:
Abstract:
Anticipating future events is an important prerequisite towards intelligent behavior. Video forecasting has been studied as a proxy task towards this goal. Recent work has shown that to predict semantic segmentation of future frames, forecasting at the semantic level is more effective than forecasting RGB frames and then segmenting these....More
Code:
Data:
Introduction
- The ability to anticipate future events is a key factor towards developing intelligent behavior [2].
- Predictive models have important applications in decision-making contexts, such as autonomous driving, where rapid control decisions can be of vital importance [7,8]
- In such contexts, the goal is not to predict the raw RGB values of future video frames, but to make predictions about future video frames at a semantically meaningful level, e.g. in terms of presence and location of object categories in a scene.
- Luc et al . [1] recently showed that for prediction of
Highlights
- The ability to anticipate future events is a key factor towards developing intelligent behavior [2]
- Video prediction has been studied as a proxy task towards pursuing this ability, which can capitalize on the huge amount of available unlabeled video to learn visual representations that account for object interactions and interactions between objects and the environment [3]
- Our work demonstrates the scalability of convolutional neural network (CNN) feature prediction, from 4K-dimensional to 32M-dimensional features, and yields results with a surprising level of accuracy and spatial detail
- We introduced a new anticipated recognition task: predicting instance segmentation of future video frames
- We predict the internal “backbone” features which are of fixed dimension, and apply the “detection head” on these features to produce a variable number of predictions
- Our results show that future instance segmentation can be predicted much better than naively copying the segmentations from the last observed frame, and that our future feature prediction approach significantly outperforms two strong baselines, the first one relying on optical-flow-based warping and the second on repurposing and fine-tuning the Mask R-CNN architecture for the task
Methods
- The authors use the Cityscapes dataset [25] which contains 2,975 train, 500 validation and 1,525 test video sequences of 1.8 second each, recorded from a car driving in urban environments.
- Ground truth semantic and instance segmentation annotations are available for the 20-th frame of each sequence.
- The authors employ a Mask R-CNN model pre-trained on the MS-COCO dataset [26] and fine-tune it in an end-to-end fashion on the Cityscapes dataset, using a ResNet-50-FPN backbone.
- The coarsest FPN level P5 has resolution 32×64, and the finest level P2 has resolution 256×512
Results
- The Mask H2F baseline frequently predicts several masks around objects, especially for objects with ambiguous trajectories, like pedestrians, and less so for more predictable categories like cars
- The authors speculate that this is due to the loss that the network is optimizing, which does not discourage this behavior, and due to which the network is learning to predict several plausible future positions, as long as they overlap sufficiently with the ground-truth position.
- The predicted masks are much more precise than those of the S2S model, which is not instance-aware
Conclusion
- Certain motions and shape transformations are hard to predict accurately due to the inherent ambiguity in the problem
- This is, e.g., the case for the legs of pedestrians in Fig. 7(b), for which there is a high degree of uncertainty on the exact pose.
- Since the model is deterministic, it predicts a rough mask due to averaging over several possibilities
- This may be addressed by modeling the intrinsic variability using GANs, VAEs, or autoregressive models [6,32,33].The authors introduced a new anticipated recognition task: predicting instance segmentation of future video frames.
- When evaluated on the more basic task of semantic segmentation without instance-level detail, the approach yields performance quantitatively comparable to earlier approaches, while having qualitative advantages
Summary
Introduction:
The ability to anticipate future events is a key factor towards developing intelligent behavior [2].- Predictive models have important applications in decision-making contexts, such as autonomous driving, where rapid control decisions can be of vital importance [7,8]
- In such contexts, the goal is not to predict the raw RGB values of future video frames, but to make predictions about future video frames at a semantically meaningful level, e.g. in terms of presence and location of object categories in a scene.
- Luc et al . [1] recently showed that for prediction of
Objectives:
The authors' goal is to predict instance-level object segmentations for one or more future frames, i.e. for frames where the authors cannot access the RGB pixel values.Methods:
The authors use the Cityscapes dataset [25] which contains 2,975 train, 500 validation and 1,525 test video sequences of 1.8 second each, recorded from a car driving in urban environments.- Ground truth semantic and instance segmentation annotations are available for the 20-th frame of each sequence.
- The authors employ a Mask R-CNN model pre-trained on the MS-COCO dataset [26] and fine-tune it in an end-to-end fashion on the Cityscapes dataset, using a ResNet-50-FPN backbone.
- The coarsest FPN level P5 has resolution 32×64, and the finest level P2 has resolution 256×512
Results:
The Mask H2F baseline frequently predicts several masks around objects, especially for objects with ambiguous trajectories, like pedestrians, and less so for more predictable categories like cars- The authors speculate that this is due to the loss that the network is optimizing, which does not discourage this behavior, and due to which the network is learning to predict several plausible future positions, as long as they overlap sufficiently with the ground-truth position.
- The predicted masks are much more precise than those of the S2S model, which is not instance-aware
Conclusion:
Certain motions and shape transformations are hard to predict accurately due to the inherent ambiguity in the problem- This is, e.g., the case for the legs of pedestrians in Fig. 7(b), for which there is a high degree of uncertainty on the exact pose.
- Since the model is deterministic, it predicts a rough mask due to averaging over several possibilities
- This may be addressed by modeling the intrinsic variability using GANs, VAEs, or autoregressive models [6,32,33].The authors introduced a new anticipated recognition task: predicting instance segmentation of future video frames.
- When evaluated on the more basic task of semantic segmentation without instance-level detail, the approach yields performance quantitatively comparable to earlier approaches, while having qualitative advantages
Tables
- Table1: Ablation study: short-term prediction on the Cityscapes val. set
- Table2: Instance segmentation accuracy on the Cityscapes validation set. * Separate models were trained for short-term and mid-term predictions
- Table3: Short and mid-term semantic segmentation of moving objects (8 classes) performance on the Cityscapes validation set. * Separate models were trained for short-term and mid-term predictions
Related work
- Future video prediction. Predictive modeling of future RGB video frames has recently been studied using a variety of techniques, including autoregressive models [6], adversarial training [3], and recurrent networks [4,5,11]. Villegas et al . [12] predict future human poses as a proxy to guide the prediction of future RGB video frames. Instead of predicting RGB values, Walker et al . [13] predict future pixel trajectories from static images.
Future prediction of more abstract representations has been considered in a variety of contexts in the past. Lan et al . [14] predict future human actions from automatically detected atomic actions. Kitani et al . [15] predict future trajectories of people from semantic segmentation of an observed video frame, modeling potential destinations and transitory areas that are preferred or avoided. Lee et al . predict future object trajectories from past object tracks and object interactions [16]. Dosovitskiy & Koltun [17] learn control models by predicting future high-level measurements in which the goal of an agent can be expressed from past video frames and measurements.
Funding
- This work has been partially supported by the grant ANR-16-CE23-0006 “Deep in France” and LabEx PERSYVAL-Lab (ANR-11LABX-0025-01)
Reference
- Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: ICCV. (2017)
- Sutton, R., Barto, A.: Reinforcement learning: An introduction. MIT Press (1998)
- Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: ICLR. (2016)
- Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv 1412.6604 (2014)
- Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML. (2015)
- Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K.: Video pixel networks. In: ICML. (2017) 7. Shalev-Shwartz, S., Ben-Zrihem, N., Cohen, A., Shashua, A.: Long-term planning by short-term prediction. arXiv 1602.01580 (2016) 8. Shalev-Shwartz, S., Shashua, A.: On the sample complexity of end-to-end training vs. semantic abstraction training. arXiv 1604.06915 (2016)
- 9. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV. (2017)
- 10. Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: CVPR. (2017)
- 11. Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: ICLR. (2017)
- 12. Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: ICML. (2017)
- 13. Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: Forecasting from static images using variational autoencoders. In: ECCV. (2016)
- 14. Lan, T., Chen, T.C., Savarese, S.: A hierarchical representation for future action prediction. In: ECCV. (2014)
- 15. Kitani, K., Ziebart, B., Bagnell, J., Hebert, M.: Activity forecasting. In: ECCV. (2012)
- 16. Lee, N., Choi, W., Vernaza, P., Choy, C., Torr, P., Chandraker, M.: DESIRE: distant future prediction in dynamic scenes with interacting agents. In: CVPR. (2017)
- 17. Dosovitskiy, A., Koltun, V.: Learning to act by predicting the future. In: ICLR. (2017)
- 18. Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. In: CVPR. (2016)
- 19. Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., Yan, S.: Predicting scene parsing and motion dynamics in the future. In: NIPS. (2017) 20. Romera-Paredes, B., Torr, P.: Recurrent instance segmentation. In: ECCV. (2016)
- 21. Bai, M., Urtasun, R.: Deep watershed transform for instance segmentation. In: CVPR. (2017)
- 22. Pinheiro, P., Lin, T.Y., Collobert, R., Dollar, P.: Learning to refine object segments. In: ECCV. (2016)
- 23. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS. (2015)
- 24. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. (2017)
- 25. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR. (2016)
- 26. Lin, T., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.: Microsoft COCO: common objects in context. In: ECCV. (2014)
- 27. Yang, A., Wright, J., Ma, Y., Sastry, S.: Unsupervised segmentation of natural images via lossy data compression. CVIU 110(2) (2008) 212–225
- 28. Parntofaru, C., Hebert, M.: A comparison of image segmentation algorithms. Technical Report CMU-RI-TR-05-40, Carnegie Mellon University (2005)
- 29. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: ICCV. (2001)
- 30. Meila, M.: Comparing clusterings: An axiomatic view. In: ICML. (2005)
- 31. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR. (2016)
- 32. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS. (2014)
- 33. Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: ICLR. (2014)
- 34. Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR. (2015)
Full Text
Tags
Comments