Video Prediction via Example Guidance

ICML, pp. 10628-10637, 2020.

Cited by: 0|Bibtex|Views140
EI
Other Links: arxiv.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
We present a simple yet effective framework for multi-modal video prediction, which mainly focuses on the capability of multi-modal distribution modelling

Abstract:

In video prediction tasks, one major challenge is to capture the multi-modal nature of future contents and dynamics. In this work, we propose a simple yet effective framework that can efficiently predict plausible future states. The key insight is that the potential distribution of a sequence could be approximated with analogous ones in...More

Code:

Data:

0
Introduction
  • Video prediction involves accurately generating possible forthcoming frames in a pixel-wise manner given several preceding images as inputs.
  • Srivastava et al (2015) first proposes to predict simple digit motion with deep neural models.
  • Variational based methods (e.g., SVG (Denton & Fergus, 2018) and SAVP (Lee et al, 2018)) are naturally developed to achieve good performance on simple dynamics such as digit moving (Srivastava et al, 2015) and robot arm manipulation (Finn et al, 2016)
Highlights
  • Video prediction involves accurately generating possible forthcoming frames in a pixel-wise manner given several preceding images as inputs
  • Our work bypasses implicit optimization of latent variable relying on variational inference; as shown in Fig. 1C, we introduce an explicit distribution target constructed from analogous examples, which are empirically proved to be critical for distribution modelling
  • Our model demonstrates generalization ability to predict unseen motion class during testing procedure, which suggests the effectiveness of example guidance
  • One concern would be the retrieval time, which is highly correlated with efficiency of the whole model
  • The averaged retrieval complexity is O(NlogN), where N is the number of video sequences
  • We present a simple yet effective framework for multi-modal video prediction, which mainly focuses on the capability of multi-modal distribution modelling
Methods
  • Given M consecutive frames as inputs, the authors are to predict the future N frames in the pixel-wise manner.
  • Suppose the input frames X is of length M , i.e., X = {xt}M t=1 ∈ RW ×H×C×M , where W, H, C are image width, height and channel respectively.
  • The prediction output Y is of length N , i.e., Y = {yt}Nt=1 ∈ RW ×H×C×N.
  • The authors denote the whole training set as Ds. Fig.
  • Details are presented in following subsections
Conclusion
  • Note that the retrieval module is introduced in this work, which is an additional step compared to the majority of previous methods.
  • One concern would be the retrieval time, which is highly correlated with efficiency of the whole model.
  • Example Guided Multi-modal PredictionIn this work, the authors present a simple yet effective framework for multi-modal video prediction, which mainly focuses on the capability of multi-modal distribution modelling.
  • The authors first retrieve similar examples in the training set and use these searched sequences to explicitly construct a distribution target.
  • With proposed optimization method based on stochastic process, the model achieves promising performance on both prediction accuracy and visual quality
Summary
  • Introduction:

    Video prediction involves accurately generating possible forthcoming frames in a pixel-wise manner given several preceding images as inputs.
  • Srivastava et al (2015) first proposes to predict simple digit motion with deep neural models.
  • Variational based methods (e.g., SVG (Denton & Fergus, 2018) and SAVP (Lee et al, 2018)) are naturally developed to achieve good performance on simple dynamics such as digit moving (Srivastava et al, 2015) and robot arm manipulation (Finn et al, 2016)
  • Methods:

    Given M consecutive frames as inputs, the authors are to predict the future N frames in the pixel-wise manner.
  • Suppose the input frames X is of length M , i.e., X = {xt}M t=1 ∈ RW ×H×C×M , where W, H, C are image width, height and channel respectively.
  • The prediction output Y is of length N , i.e., Y = {yt}Nt=1 ∈ RW ×H×C×N.
  • The authors denote the whole training set as Ds. Fig.
  • Details are presented in following subsections
  • Conclusion:

    Note that the retrieval module is introduced in this work, which is an additional step compared to the majority of previous methods.
  • One concern would be the retrieval time, which is highly correlated with efficiency of the whole model.
  • Example Guided Multi-modal PredictionIn this work, the authors present a simple yet effective framework for multi-modal video prediction, which mainly focuses on the capability of multi-modal distribution modelling.
  • The authors first retrieve similar examples in the training set and use these searched sequences to explicitly construct a distribution target.
  • With proposed optimization method based on stochastic process, the model achieves promising performance on both prediction accuracy and visual quality
Tables
  • Table1: Prediction accuracy on MovinMnist dataset (<a class="ref-link" id="cSrivastava_et+al_2015_a" href="#rSrivastava_et+al_2015_a">Srivastava et al, 2015</a>) in terms of PSNR. Mode refers to experiment setting, i.e., stochastic (S) or deterministic (D). We compare our model with SVG-LP (<a class="ref-link" id="cDenton_2018_a" href="#rDenton_2018_a">Denton & Fergus, 2018</a>) and DFN (<a class="ref-link" id="cJia_et+al_2016_a" href="#rJia_et+al_2016_a">Jia et al, 2016</a>)
  • Table2: Quantitative evaluation of predicted sequences in terms of Frechet Video Distance (FVD) (<a class="ref-link" id="cUnterthiner_et+al_2018_a" href="#rUnterthiner_et+al_2018_a">Unterthiner et al, 2018</a>) (lower is better) and action recognition accuracy (higher is better). Previous works [1]-[4] refer to (<a class="ref-link" id="cLi_et+al_2018_a" href="#rLi_et+al_2018_a">Li et al, 2018</a>; <a class="ref-link" id="cWichers_et+al_2018_a" href="#rWichers_et+al_2018_a">Wichers et al, 2018</a>; <a class="ref-link" id="cVillegas_et+al_2017_b" href="#rVillegas_et+al_2017_b">Villegas et al, 2017b</a>) respectively. Experiment is conducted on PennAction dataset (<a class="ref-link" id="cZhang_et+al_2013_a" href="#rZhang_et+al_2013_a">Zhang et al, 2013</a>)
  • Table3: Influence of the example number K evaluated in terms of PSNR (first row) and SSIM (second row) on RobotPush (<a class="ref-link" id="cEbert_et+al_2017_a" href="#rEbert_et+al_2017_a">Ebert et al, 2017</a>) dataset. Note that each number reported in this table is averaged over the whole predicted sequence
Download tables as Excel
Related work
  • Distribution Modelling with Stochastic Process. In this filed, one major direction is based on Gaussian process (denoted as GP) (Rasmussen & Williams, 2006). Wang et al (2005) proposes to extend basic GP model with dynamic formation, which demonstrates appealing ability of learning human motion diversity. Another promising branch is determinantal point process (denoted as DPP) (Affandi et al, 2014; Elfeki et al, 2019), which focuses on diversity of modelled distribution by incorporating a penalty term during optimization procedure. Recently, the combination of stochastic process and deep neural network, e.g., neural process (Garnelo et al, 2018) leads to a new routine towards applying stochastic process on large-scale data. Neural process (Garnelo et al, 2018) combines the best of both worlds between stochastic process (data-driven uncertainty modelling) and deep model (end-to-end training with large-scale data). Our work, which treads on a similar path, focuses on the distribution modelling of real-world motion sequences.
Reference
  • Affandi, R. H., Fox, E. B., Adams, R. P., and Taskar, B. Learning the parameters of determinantal point process kernels. In ICML, 2014.
    Google ScholarLocate open access versionFindings
  • Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., and Levine, S. Stochastic variational video prediction. In ICLR, 2018.
    Google ScholarFindings
  • Byeon, W., Wang, Q., Srivastava, R. K., and Koumoutsakos, P. Contextvp: Fully context-aware video prediction. In ECCV, 2018.
    Google ScholarFindings
  • Denton, E. and Fergus, R. Stochastic video generation with a learned prior. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Denton, E. L. and Birodkar, V. Unsupervised learning of disentangled representations from video. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Ebert, F., Finn, C., Lee, A. X., and Levine, S. Selfsupervised visual planning with temporal skip connections. In CoRL, 2017.
    Google ScholarLocate open access versionFindings
  • Elfeki, M., Couprie, C., Riviere, M., and Elhoseiny, M. GDPP: learning diverse generations using determinantal point processes. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Finn, C., Goodfellow, I. J., and Levine, S. Unsupervised learning for physical interaction through video prediction. In NeurIPS, 2016.
    Google ScholarFindings
  • Gao, H., Xu, H., Cai, Q., Wang, R., Yu, F., and Darrell, T. Disentangling propagation and generation for video prediction. CoRR, 2018.
    Google ScholarFindings
  • Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D. J., and Eslami, S. M. A. Conditional neural processes. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Y. Generative adversarial nets. In NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 1997.
    Google ScholarLocate open access versionFindings
  • Jia, X., Brabandere, B. D., Tuytelaars, T., and Gool, L. V. Dynamic filter networks. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Kim, Y., Nam, S., Cho, I., and Kim, S. J. Unsupervised keypoint learning for guiding class-conditional video prediction. In NeurIPS, 2019.
    Google ScholarFindings
  • Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Kullback, S. and Leibler, R. A. Ann. Math. Statist., 1951.
    Google ScholarLocate open access versionFindings
  • Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., and Kingma, D. Videoflow: A conditional flow-based model for stochastic video generation. In ICLR, 2020.
    Google ScholarFindings
  • Kurutach, T., Tamar, A., Yang, G., Russell, S. J., and Abbeel, P. Learning plannable representations with causal infogan. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S. Stochastic adversarial video prediction. CoRR, 2018.
    Google ScholarFindings
  • Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., and Yang, M. Flow-grounded spatial-temporal video prediction from still images. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Lotter, W., Kreiman, G., and Cox, D. D. Deep predictive coding networks for video prediction and unsupervised learning. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Luc, P., Neverova, N., Couprie, C., Verbeek, J., and LeCun, Y. Predicting deeper into the future of semantic segmentation. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual reinforcement learning with imagined goals. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. 2006.
    Google ScholarFindings
  • Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., and Woo, W. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Srivastava, N., Mansimov, E., and Salakhutdinov, R. Unsupervised learning of video representations using lstms. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Tang, Y. C. and Salakhutdinov, R. Multiple futures prediction. In NIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Tulyakov, S., Liu, M., Yang, X., and Kautz, J. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
    Google ScholarFindings
  • Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., and Gelly, S. Towards accurate generative models of video: A new metric & challenges. CoRR, 2018.
    Google ScholarLocate open access versionFindings
  • Villegas, R., Yang, J., Hong, S., Lin, X., and Lee, H. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017a.
    Google ScholarLocate open access versionFindings
  • Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., and Lee, H. Learning to generate long-term future via hierarchical prediction. In ICML, 2017b.
    Google ScholarLocate open access versionFindings
  • Villegas, R., Pathak, A., Kannan, H., Erhan, D., Le, Q. V., and Lee, H. High fidelity video prediction with large stochastic recurrent neural networks. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E. B., and Garnett, R. (eds.), NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Wang, J. M., Fleet, D. J., and Hertzmann, A. Gaussian process dynamical models. In NeurIPS, 2005.
    Google ScholarLocate open access versionFindings
  • Wang, Y., Long, M., Wang, J., Gao, Z., and Yu, P. S. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Wang, Y., Jiang, L., Yang, M., Li, L., Long, M., and Fei-Fei, L. Eidetic 3d LSTM: A model for video prediction and beyond. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Wichers, N., Villegas, R., Erhan, D., and Lee, H. Hierarchical long-term video prediction without supervision. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Xu, J., Ni, B., Li, Z., Cheng, S., and Yang, X. Structure preserving video prediction. In CVPR, 2018a.
    Google ScholarFindings
  • Xu, J., Ni, B., and Yang, X. Video prediction via selective sampling. In NeurIPS, 2018b.
    Google ScholarLocate open access versionFindings
  • Xu, J., Yu, Z., Ni, B., Yang, J., Yang, X., and Zhang, W. Deep kinematics analysis for monocular 3d human pose estimation. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Yan, Y., Xu, J., Ni, B., Zhang, W., and Yang, X. Skeletonaided articulated motion generation. In ACM MM, 2017.
    Google ScholarLocate open access versionFindings
  • Ye, Y., Singh, M., Gupta, A., and Tulsiani, S. Compositional video prediction. In ICCV, 2019.
    Google ScholarFindings
  • Zhang, W., Zhu, M., and Derpanis, K. G. From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV, 2013.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments