# Video Prediction via Example Guidance

ICML, pp. 10628-10637, 2020.

EI

Weibo:

Abstract:

In video prediction tasks, one major challenge is to capture the multi-modal nature of future contents and dynamics. In this work, we propose a simple yet effective framework that can efficiently predict plausible future states. The key insight is that the potential distribution of a sequence could be approximated with analogous ones in...More

Code:

Data:

Introduction

- Video prediction involves accurately generating possible forthcoming frames in a pixel-wise manner given several preceding images as inputs.
- Srivastava et al (2015) first proposes to predict simple digit motion with deep neural models.
- Variational based methods (e.g., SVG (Denton & Fergus, 2018) and SAVP (Lee et al, 2018)) are naturally developed to achieve good performance on simple dynamics such as digit moving (Srivastava et al, 2015) and robot arm manipulation (Finn et al, 2016)

Highlights

- Video prediction involves accurately generating possible forthcoming frames in a pixel-wise manner given several preceding images as inputs
- Our work bypasses implicit optimization of latent variable relying on variational inference; as shown in Fig. 1C, we introduce an explicit distribution target constructed from analogous examples, which are empirically proved to be critical for distribution modelling
- Our model demonstrates generalization ability to predict unseen motion class during testing procedure, which suggests the effectiveness of example guidance
- One concern would be the retrieval time, which is highly correlated with efficiency of the whole model
- The averaged retrieval complexity is O(NlogN), where N is the number of video sequences
- We present a simple yet effective framework for multi-modal video prediction, which mainly focuses on the capability of multi-modal distribution modelling

Methods

- Given M consecutive frames as inputs, the authors are to predict the future N frames in the pixel-wise manner.
- Suppose the input frames X is of length M , i.e., X = {xt}M t=1 ∈ RW ×H×C×M , where W, H, C are image width, height and channel respectively.
- The prediction output Y is of length N , i.e., Y = {yt}Nt=1 ∈ RW ×H×C×N.
- The authors denote the whole training set as Ds. Fig.
- Details are presented in following subsections

Conclusion

- Note that the retrieval module is introduced in this work, which is an additional step compared to the majority of previous methods.
- One concern would be the retrieval time, which is highly correlated with efficiency of the whole model.
- Example Guided Multi-modal PredictionIn this work, the authors present a simple yet effective framework for multi-modal video prediction, which mainly focuses on the capability of multi-modal distribution modelling.
- The authors first retrieve similar examples in the training set and use these searched sequences to explicitly construct a distribution target.
- With proposed optimization method based on stochastic process, the model achieves promising performance on both prediction accuracy and visual quality

Summary

## Introduction:

Video prediction involves accurately generating possible forthcoming frames in a pixel-wise manner given several preceding images as inputs.- Srivastava et al (2015) first proposes to predict simple digit motion with deep neural models.
- Variational based methods (e.g., SVG (Denton & Fergus, 2018) and SAVP (Lee et al, 2018)) are naturally developed to achieve good performance on simple dynamics such as digit moving (Srivastava et al, 2015) and robot arm manipulation (Finn et al, 2016)
## Methods:

Given M consecutive frames as inputs, the authors are to predict the future N frames in the pixel-wise manner.- Suppose the input frames X is of length M , i.e., X = {xt}M t=1 ∈ RW ×H×C×M , where W, H, C are image width, height and channel respectively.
- The prediction output Y is of length N , i.e., Y = {yt}Nt=1 ∈ RW ×H×C×N.
- The authors denote the whole training set as Ds. Fig.
- Details are presented in following subsections
## Conclusion:

Note that the retrieval module is introduced in this work, which is an additional step compared to the majority of previous methods.- One concern would be the retrieval time, which is highly correlated with efficiency of the whole model.
- Example Guided Multi-modal PredictionIn this work, the authors present a simple yet effective framework for multi-modal video prediction, which mainly focuses on the capability of multi-modal distribution modelling.
- The authors first retrieve similar examples in the training set and use these searched sequences to explicitly construct a distribution target.
- With proposed optimization method based on stochastic process, the model achieves promising performance on both prediction accuracy and visual quality

- Table1: Prediction accuracy on MovinMnist dataset (<a class="ref-link" id="cSrivastava_et+al_2015_a" href="#rSrivastava_et+al_2015_a">Srivastava et al, 2015</a>) in terms of PSNR. Mode refers to experiment setting, i.e., stochastic (S) or deterministic (D). We compare our model with SVG-LP (<a class="ref-link" id="cDenton_2018_a" href="#rDenton_2018_a">Denton & Fergus, 2018</a>) and DFN (<a class="ref-link" id="cJia_et+al_2016_a" href="#rJia_et+al_2016_a">Jia et al, 2016</a>)
- Table2: Quantitative evaluation of predicted sequences in terms of Frechet Video Distance (FVD) (<a class="ref-link" id="cUnterthiner_et+al_2018_a" href="#rUnterthiner_et+al_2018_a">Unterthiner et al, 2018</a>) (lower is better) and action recognition accuracy (higher is better). Previous works [1]-[4] refer to (<a class="ref-link" id="cLi_et+al_2018_a" href="#rLi_et+al_2018_a">Li et al, 2018</a>; <a class="ref-link" id="cWichers_et+al_2018_a" href="#rWichers_et+al_2018_a">Wichers et al, 2018</a>; <a class="ref-link" id="cVillegas_et+al_2017_b" href="#rVillegas_et+al_2017_b">Villegas et al, 2017b</a>) respectively. Experiment is conducted on PennAction dataset (<a class="ref-link" id="cZhang_et+al_2013_a" href="#rZhang_et+al_2013_a">Zhang et al, 2013</a>)
- Table3: Influence of the example number K evaluated in terms of PSNR (first row) and SSIM (second row) on RobotPush (<a class="ref-link" id="cEbert_et+al_2017_a" href="#rEbert_et+al_2017_a">Ebert et al, 2017</a>) dataset. Note that each number reported in this table is averaged over the whole predicted sequence

Related work

- Distribution Modelling with Stochastic Process. In this filed, one major direction is based on Gaussian process (denoted as GP) (Rasmussen & Williams, 2006). Wang et al (2005) proposes to extend basic GP model with dynamic formation, which demonstrates appealing ability of learning human motion diversity. Another promising branch is determinantal point process (denoted as DPP) (Affandi et al, 2014; Elfeki et al, 2019), which focuses on diversity of modelled distribution by incorporating a penalty term during optimization procedure. Recently, the combination of stochastic process and deep neural network, e.g., neural process (Garnelo et al, 2018) leads to a new routine towards applying stochastic process on large-scale data. Neural process (Garnelo et al, 2018) combines the best of both worlds between stochastic process (data-driven uncertainty modelling) and deep model (end-to-end training with large-scale data). Our work, which treads on a similar path, focuses on the distribution modelling of real-world motion sequences.

Reference

- Affandi, R. H., Fox, E. B., Adams, R. P., and Taskar, B. Learning the parameters of determinantal point process kernels. In ICML, 2014.
- Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., and Levine, S. Stochastic variational video prediction. In ICLR, 2018.
- Byeon, W., Wang, Q., Srivastava, R. K., and Koumoutsakos, P. Contextvp: Fully context-aware video prediction. In ECCV, 2018.
- Denton, E. and Fergus, R. Stochastic video generation with a learned prior. In ICML, 2018.
- Denton, E. L. and Birodkar, V. Unsupervised learning of disentangled representations from video. In NeurIPS, 2017.
- Ebert, F., Finn, C., Lee, A. X., and Levine, S. Selfsupervised visual planning with temporal skip connections. In CoRL, 2017.
- Elfeki, M., Couprie, C., Riviere, M., and Elhoseiny, M. GDPP: learning diverse generations using determinantal point processes. In ICML, 2019.
- Finn, C., Goodfellow, I. J., and Levine, S. Unsupervised learning for physical interaction through video prediction. In NeurIPS, 2016.
- Gao, H., Xu, H., Cai, Q., Wang, R., Yu, F., and Darrell, T. Disentangling propagation and generation for video prediction. CoRR, 2018.
- Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D. J., and Eslami, S. M. A. Conditional neural processes. In ICML, 2018.
- Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Y. Generative adversarial nets. In NeurIPS, 2014.
- Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 1997.
- Jia, X., Brabandere, B. D., Tuytelaars, T., and Gool, L. V. Dynamic filter networks. In NIPS, 2016.
- Kim, Y., Nam, S., Cho, I., and Kim, S. J. Unsupervised keypoint learning for guiding class-conditional video prediction. In NeurIPS, 2019.
- Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In ICLR, 2014.
- Kullback, S. and Leibler, R. A. Ann. Math. Statist., 1951.
- Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., and Kingma, D. Videoflow: A conditional flow-based model for stochastic video generation. In ICLR, 2020.
- Kurutach, T., Tamar, A., Yang, G., Russell, S. J., and Abbeel, P. Learning plannable representations with causal infogan. In NeurIPS, 2018.
- Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S. Stochastic adversarial video prediction. CoRR, 2018.
- Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., and Yang, M. Flow-grounded spatial-temporal video prediction from still images. In ECCV, 2018.
- Lotter, W., Kreiman, G., and Cox, D. D. Deep predictive coding networks for video prediction and unsupervised learning. In ICLR, 2017.
- Luc, P., Neverova, N., Couprie, C., Verbeek, J., and LeCun, Y. Predicting deeper into the future of semantic segmentation. In ICCV, 2017.
- Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual reinforcement learning with imagined goals. In NeurIPS, 2018.
- Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.
- Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. 2006.
- Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., and Woo, W. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NeurIPS, 2015.
- Srivastava, N., Mansimov, E., and Salakhutdinov, R. Unsupervised learning of video representations using lstms. In ICML, 2015.
- Tang, Y. C. and Salakhutdinov, R. Multiple futures prediction. In NIPS, 2019.
- Tulyakov, S., Liu, M., Yang, X., and Kautz, J. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
- Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Towards accurate generative models of video: A new metric & challenges. CoRR, 2018.
- Villegas, R., Yang, J., Hong, S., Lin, X., and Lee, H. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017a.
- Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., and Lee, H. Learning to generate long-term future via hierarchical prediction. In ICML, 2017b.
- Villegas, R., Pathak, A., Kannan, H., Erhan, D., Le, Q. V., and Lee, H. High fidelity video prediction with large stochastic recurrent neural networks. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E. B., and Garnett, R. (eds.), NeurIPS, 2019.
- Wang, J. M., Fleet, D. J., and Hertzmann, A. Gaussian process dynamical models. In NeurIPS, 2005.
- Wang, Y., Long, M., Wang, J., Gao, Z., and Yu, P. S. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In NIPS, 2017.
- Wang, Y., Jiang, L., Yang, M., Li, L., Long, M., and Fei-Fei, L. Eidetic 3d LSTM: A model for video prediction and beyond. In ICLR, 2019.
- Wichers, N., Villegas, R., Erhan, D., and Lee, H. Hierarchical long-term video prediction without supervision. In ICML, 2018.
- Xu, J., Ni, B., Li, Z., Cheng, S., and Yang, X. Structure preserving video prediction. In CVPR, 2018a.
- Xu, J., Ni, B., and Yang, X. Video prediction via selective sampling. In NeurIPS, 2018b.
- Xu, J., Yu, Z., Ni, B., Yang, J., Yang, X., and Zhang, W. Deep kinematics analysis for monocular 3d human pose estimation. In CVPR, 2020.
- Yan, Y., Xu, J., Ni, B., Zhang, W., and Yang, X. Skeletonaided articulated motion generation. In ACM MM, 2017.
- Ye, Y., Singh, M., Gupta, A., and Tulsiani, S. Compositional video prediction. In ICCV, 2019.
- Zhang, W., Zhu, M., and Derpanis, K. G. From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV, 2013.

Full Text

Tags

Comments