Eidetic 3D LSTM - A Model for Video Prediction and Beyond

ICLR, 2019.

Cited by: 23|Bibtex|Views247
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We presented the E3D-LSTM model based on 3D convolutional recurrent units for this task

Abstract:

Spatiotemporal predictive learning, though long considered to be a promising selfsupervised feature learning method, seldom shows its effectiveness beyond future video prediction. The reason is that it is difficult to learn good representations for both short-term frame dependency and long-term high-level relations. We present a new model...More

Code:

Data:

0
Introduction
  • A fundamental problem in spatiotemporal predictive learning is how to effectively learn good representations for video inference or reasoning.
  • Variants of 3D-CNNs, such as Inflated 3D-CNNs, have significantly increased action classification accuracy over the UCF 101 and Kinetics datasets.
  • These 3D-CNN architectures have no recurrent structures but instead employ 3D convolution (3D-Conv) and 3D pooling operations to preserve temporal information of the input sequences which would be otherwise discarded in classical 2D convolution operations
Highlights
  • A fundamental problem in spatiotemporal predictive learning is how to effectively learn good representations for video inference or reasoning
  • Motivated by the recent success of 3D Convolutional Neural networks, in this paper we propose a new model for spatiotemporal predictive learning based on both recurrent modeling and feedforward 3D convolution modeling
  • We propose a “deeper” integration of 3D convolution inside the LSTM unit in order to incorporate the convolutional features into the recurrent state transition over time
  • We report mean squared error at every time stamp in Table 4 where lower scores indicate better prediction results
  • We presented the E3D-LSTM model based on 3D convolutional recurrent units for this task
  • Experimental results demonstrate that the E3D-LSTM model performs favorably against the state-of-the-art methods on video prediction and early activity recognition tasks
Methods
  • The authors evaluate the proposed E3D-LSTM model on two tasks: future video prediction and early activity recognition.
  • These two tasks are of great importance with numerous applications that require effective spatiotemporal predictive models.
  • Seq. 1 Inputs Seq. 2 Inputs.
  • Seq. 1 Predictions Seq. 2 Predictions.
  • PredRNN VPN Baseline ConvLSTM Ours PredRNN++.
  • ConvLSTM (a) 10 → 10 Prediction (b) Copy Test MODEL
Results
  • Consistent with the observations on the moving MNIST dataset, the E3D-LSTM model performs favorably against the state-of-the-art methods across three settings of predicting future 10 frames, 20 frames, and copy test
  • These empirical results demonstrate the effectiveness of the E3D-LSTM model for modeling spatiotemporal data.
  • Table 5 shows the classification accuracy of the E3D-LSTM network against the stateof-the-art feed-forward 3D-CNNs. The E3D-LSTM model performs favorably against the other methods in two settings of using the first 25% and 50% frames, showing its effectiveness in learning high-level spatiotemporal representations.
  • The two methods are trained using different backbone networks and different splits of datasets
Conclusion
  • Spatiotemporal predictive learning has shown significant improvements in a variety of applications, such as weather forecasting, traffic flow prediction, and physical interaction simulation.
  • The authors presented the E3D-LSTM model based on 3D convolutional recurrent units for this task.
  • In this model, the authors integrated 3D-Convs into state transitions to perceive short-term motions and designed a memory attentive module controlled by recurrent gates to capture the long-term video frame interaction.
  • Experimental results demonstrate that the E3D-LSTM model performs favorably against the state-of-the-art methods on video prediction and early activity recognition tasks
Summary
  • Introduction:

    A fundamental problem in spatiotemporal predictive learning is how to effectively learn good representations for video inference or reasoning.
  • Variants of 3D-CNNs, such as Inflated 3D-CNNs, have significantly increased action classification accuracy over the UCF 101 and Kinetics datasets.
  • These 3D-CNN architectures have no recurrent structures but instead employ 3D convolution (3D-Conv) and 3D pooling operations to preserve temporal information of the input sequences which would be otherwise discarded in classical 2D convolution operations
  • Methods:

    The authors evaluate the proposed E3D-LSTM model on two tasks: future video prediction and early activity recognition.
  • These two tasks are of great importance with numerous applications that require effective spatiotemporal predictive models.
  • Seq. 1 Inputs Seq. 2 Inputs.
  • Seq. 1 Predictions Seq. 2 Predictions.
  • PredRNN VPN Baseline ConvLSTM Ours PredRNN++.
  • ConvLSTM (a) 10 → 10 Prediction (b) Copy Test MODEL
  • Results:

    Consistent with the observations on the moving MNIST dataset, the E3D-LSTM model performs favorably against the state-of-the-art methods across three settings of predicting future 10 frames, 20 frames, and copy test
  • These empirical results demonstrate the effectiveness of the E3D-LSTM model for modeling spatiotemporal data.
  • Table 5 shows the classification accuracy of the E3D-LSTM network against the stateof-the-art feed-forward 3D-CNNs. The E3D-LSTM model performs favorably against the other methods in two settings of using the first 25% and 50% frames, showing its effectiveness in learning high-level spatiotemporal representations.
  • The two methods are trained using different backbone networks and different splits of datasets
  • Conclusion:

    Spatiotemporal predictive learning has shown significant improvements in a variety of applications, such as weather forecasting, traffic flow prediction, and physical interaction simulation.
  • The authors presented the E3D-LSTM model based on 3D convolutional recurrent units for this task.
  • In this model, the authors integrated 3D-Convs into state transitions to perceive short-term motions and designed a memory attentive module controlled by recurrent gates to capture the long-term video frame interaction.
  • Experimental results demonstrate that the E3D-LSTM model performs favorably against the state-of-the-art methods on video prediction and early activity recognition tasks
Tables
  • Table1: Results on the Moving MNIST dataset. All models, except DFN and VPN, are trained with a comparable number of parameters. Higher SSIM or lower MSE scores indicate better results
  • Table2: Ablation study on the Moving MNIST dataset (10 → 10)
  • Table3: Quantitative evaluation of different methods on the KTH human action test set. The metrics are averaged over the predicted frames. Higher scores indicate better prediction results
  • Table4: Experimental results on the TaxiBJ dataset. We report MSE at every time stamp
  • Table5: Early activity recognition accuracy on the 41-category subset of Something-Something
  • Table6: Ablation study of early activity recognition on the Something-Something dataset
  • Table7: Accuracy comparisons of different training strategies on the Something-Something dataset
  • Table8: Online early recognition accuracy: the classifier is built on the last 5 recurrent output states
Download tables as Excel
Related work
  • RELATED WORK AND PROBLEM CONTEXT

    Spatiotemporal Predictive Learning Models. In recent years, RNNs have been extensively used in sequence prediction and future frame prediction. Srivastava et al (2015) extended the LSTMbased sequence to sequence model (Sutskever et al, 2014) for language modeling to learning video representations. Shi et al (2015) proposed the convolutional LSTM by integrating convolutions into recurrent state transitions for high-dimensional sequence prediction. The convolutional LSTM model is extended by Finn et al (2016) to predict future states of robotic environments. Villegas et al (2017) leveraged optical flow to help capture short-term video dynamics for video prediction. Xu et al (2018) proposed a two-stream RNN that deals with structural video content in separate streams. Kalchbrenner et al (2017) introduced a sophisticated model that extends recurrent structures to estimate local dependencies between adjacent pixels. While this video pixel network (VPN) model is able to describe image sequences, the computational load is prohibitively high.
Funding
  • Mingsheng Long was supported by National Natural Science Foundation of China (61772299, 71690231)
Reference
  • Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
    Findings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, 2009.
    Google ScholarLocate open access versionFindings
  • Prateep Bhattacharjee and Sukhendu Das. Temporal coherency based criteria for predicting video frames using deep multi-stage generative adversarial networks. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Ali Diba, Ali Mohammad Pazandeh, and Luc Van Gool. Efficient two-stream motion and appearance 3d cnns for video classification. In ECCV Workshop, 2016.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
    Google ScholarFindings
  • Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The "something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computaiton, 9(8): 1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Chaochao Lu, Michael Hirsch, and Bernhard Schölkopf. Flexible spatio-temporal networks for video prediction. In CVPR, 2017.
    Google ScholarFindings
  • Shugao Ma, Leonid Sigal, and Stan Sclaroff. Learning activity progression in lstms for activity detection and early detection. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • Marc Oliu, Javier Selva, and Sergio Escalera. Folded recurrent neural networks for future video prediction. In ECCV, 2018.
    Google ScholarFindings
  • Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In ICPR, 2004.
    Google ScholarLocate open access versionFindings
  • Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018a.
    Google ScholarLocate open access versionFindings
  • Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and S Yu Philip. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S Yu. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In ICML, 2018b.
    Google ScholarLocate open access versionFindings
  • Zhou Wang, A. C Bovik, H. R Sheikh, and E. P Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 13(4):600, 2004.
    Google ScholarLocate open access versionFindings
  • Nevan Wichers, Ruben Villegas, Dumitru Erhan, and Honglak Lee. Hierarchical long-term video prediction without supervision. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning for video understanding. In ECCV, 2018.
    Google ScholarFindings
  • Jingwei Xu, Bingbing Ni, Zefan Li, Shuo Cheng, and Xiaokang Yang. Structure preserving video prediction. In CVPR, 2018.
    Google ScholarFindings
  • Kuo-Hao Zeng, William B Shen, De-An Huang, Min Sun, and Juan Carlos Niebles. Visual forecasting by imitating dynamics in natural sequences. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Junbo Zhang, Yu Zheng, and Dekang Qi. Deep spatio-temporal residual networks for citywide crowd flows prediction. In AAAI, 2017.
    Google ScholarLocate open access versionFindings
  • Bolei Zhou, Alex Andonian, and Antonio Torralba. Temporal relational reasoning in videos. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments