Few-Shot Video Classification via Temporal Alignment

CVPR 2020, 2019.

Cited by: 4|Bibtex|Views34|Links
EI
Keywords:
temporal ordering informationcompound memory networkJD.com American Technologies Corporationmultilayer perceptronsprevious workMore(12+)
Weibo:
Our results and ablations show that our model significantly outperforms a wide range of competitive baselines and achieves state-of-the-art results on two challenging realworld datasets

Abstract:

There is a growing interest in learning a model which could recognize novel classes with only a few labeled examples. In this paper, we propose Temporal Alignment Module (TAM), a novel few-shot learning framework that can learn to classify a previous unseen video. While most previous works neglect long-term temporal ordering information...More

Code:

Data:

0
Introduction
  • The emergence of deep learning has greatly advanced the frontiers of action recognition [41, 4].
  • In order to recognize novel classes that a pretrained network has not seen before, typically the authors need to manually collect hundreds of video samples for knowledge transferring.
  • Such a procedure is rather tedious and labor intensive especially for videos, where the difficulty and cost of labeling is much higher compared to images.
  • Under the setup of meta-learning based few-shot learning, the model is explicitly trained to
Highlights
  • The emergence of deep learning has greatly advanced the frontiers of action recognition [41, 4]
  • Our main contributions are: (i) We are the first to explicitly address the non-linear temporal variations issue in the few-shot video classification setting. We propose Temporal Alignment Module (TAM), a dataefficient few-shot learning framework that can dynamically align two video sequences while preserving the temporal ordering, which is often neglected in previous works. We use continuous relaxation to make our model fully differentiable and show that it outperforms previous state-of-the-art methods by a large margin on two challenging datasets
  • We show qualitative results comparing compound memory network and Temporal Alignment Module in Fig. 4
  • We propose Temporal Alignment Module (TAM), a novel few-shot framework that can explicitly learn distance measure and representation independent of non-linear temporal variations in videos using very few data
  • In contrast to previous works, Temporal Alignment Module dynamically aligns two video sequences while preserving the temporal ordering and it further uses continuous relaxation to directly optimize for the few-shot learning objective in an end-to-end fashion
  • Our results and ablations show that our model significantly outperforms a wide range of competitive baselines and achieves state-of-the-art results on two challenging realworld datasets
Methods
  • The authors' goal is to learn a model which can classify novel classes of videos with only a few labeled examples.
  • The wide range of intra-class spatial-temporal variations of videos poses great challenges for few-shot video classifications.
  • The authors address this challenge by proposing a fewshot learning framework with Temporal Alignment Module (TAM), which is to the best knowledge the first model that can explicitly learn a distance measure independent of non-linear temporal variations in videos.
  • The authors will first provide a problem formulation of few-shot video classification task, and define the model and show how it can be used at training and test time
Results
  • Qualitative Results and Visualizations

    The authors show qualitative results comparing CMN and TAM in Fig. 4.
  • The authors observe that CMN has difficulty in differentiating two actions from different classes with very similar visual clues among all the frames, e.g., backgrounds.
  • As can be seen from the distance matrices in Fig. 4, though the method cannot alter the fact that the two visually similar action clips will have an averagely lower frame-wise distance value, it is able to find a temporal alignment that minimize the cumulative distance score between the query action video and the true support class video while the per-frame visual clue is not evident enough.
  • The authors have shown in Section 4.3 that explicitly modeling the tem-
Conclusion
  • The authors propose Temporal Alignment Module (TAM), a novel few-shot framework that can explicitly learn distance measure and representation independent of non-linear temporal variations in videos using very few data.
  • The authors' results and ablations show that the model significantly outperforms a wide range of competitive baselines and achieves state-of-the-art results on two challenging realworld datasets
Summary
  • Introduction:

    The emergence of deep learning has greatly advanced the frontiers of action recognition [41, 4].
  • In order to recognize novel classes that a pretrained network has not seen before, typically the authors need to manually collect hundreds of video samples for knowledge transferring.
  • Such a procedure is rather tedious and labor intensive especially for videos, where the difficulty and cost of labeling is much higher compared to images.
  • Under the setup of meta-learning based few-shot learning, the model is explicitly trained to
  • Methods:

    The authors' goal is to learn a model which can classify novel classes of videos with only a few labeled examples.
  • The wide range of intra-class spatial-temporal variations of videos poses great challenges for few-shot video classifications.
  • The authors address this challenge by proposing a fewshot learning framework with Temporal Alignment Module (TAM), which is to the best knowledge the first model that can explicitly learn a distance measure independent of non-linear temporal variations in videos.
  • The authors will first provide a problem formulation of few-shot video classification task, and define the model and show how it can be used at training and test time
  • Results:

    Qualitative Results and Visualizations

    The authors show qualitative results comparing CMN and TAM in Fig. 4.
  • The authors observe that CMN has difficulty in differentiating two actions from different classes with very similar visual clues among all the frames, e.g., backgrounds.
  • As can be seen from the distance matrices in Fig. 4, though the method cannot alter the fact that the two visually similar action clips will have an averagely lower frame-wise distance value, it is able to find a temporal alignment that minimize the cumulative distance score between the query action video and the true support class video while the per-frame visual clue is not evident enough.
  • The authors have shown in Section 4.3 that explicitly modeling the tem-
  • Conclusion:

    The authors propose Temporal Alignment Module (TAM), a novel few-shot framework that can explicitly learn distance measure and representation independent of non-linear temporal variations in videos using very few data.
  • The authors' results and ablations show that the model significantly outperforms a wide range of competitive baselines and achieves state-of-the-art results on two challenging realworld datasets
Tables
  • Table1: Few-shot video classification results. We report 5-way video classification accuracy on meta-testing set
  • Table2: Temporal matching ablation study. We compare our method to temporal-agnostic and temporal-aware baselines
Download tables as Excel
Related work
  • Few-Shot Learning. To address few-shot learning, a direct approach is to train a model on the training set and finetune with the few data in the novel classes. Since the data in novel classes are not enough to fine-tune the model with general learning techniques, methods are proposed to learn a good initialization model [9, 26, 32] or develop a novel optimizer [30, 25]. These works aim to relieve the difficulty of fine-tuning the model with limited samples. However, such methods suffer from overfitting when the training data in novel classes are scarce but the variance is large. Another branch of works, which learns a common metric for both seen and novel classes, can avoid overfitting to some extent. Convolutional Siamese Net [20] trains a Siamese network to compare two samples. Latent Embedding Optimization [39] employs attention kernel to measure the distance. Prototypical Network [35] utilizes the Euclidean distance to the class center. Graph Neural Networks [10] constructs a weighted graph to represent all the data and measure the similarity between data. Other methods use data augmentation, which learns to augment labeled data in unseen classes for supervised training [13, 44]. However, video generation is still an under-explored problem at least generating videos condition on a typical category. Thus, in this paper, we employ the metric learning approach and designs a temporalaligned video metric for few-shot video classification.
Funding
  • This work has been partially supported by JD.com American Technologies Corporation (JD) under the SAILJD AI Research Initiative
Reference
  • The 20bn-jester dataset vhttps://20bn.com/datasets/jester.5
    Findings
  • S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997. 3
    Google ScholarLocate open access versionFindings
  • L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. 6
    Google ScholarLocate open access versionFindings
  • J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 1, 2, 4
    Google ScholarLocate open access versionFindings
  • C.-Y. Chang, D.-A. Huang, Y. Sui, L. Fei-Fei, and J. C. Niebles. D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. arXiv preprint arXiv:1901.02598, 2019. 3, 5, 8
    Findings
  • W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. Wang, and J.-B. Huang. A closer look at few-shot classification. In International Conference on Learning Representations, 2019. 1, 6
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009. 6
    Google ScholarFindings
  • P. Dogan, B. Li, L. Sigal, and M. Gross. A neural multisequence alignment technique (neumatch). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8749–8758, 2013
    Google ScholarLocate open access versionFindings
  • C. Finn, P. Abbeel, and S. Levine. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1126–1135. JMLR. org, 2017. 2, 6
    Google ScholarLocate open access versionFindings
  • V. Garcia and J. Bruna. Few-shot learning with graph neural networks. In ICLR, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018. 6
    Google ScholarLocate open access versionFindings
  • R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, volume 2, page 8, 2017. 2, 5, 6
    Google ScholarLocate open access versionFindings
  • B. Hariharan and R. Girshick. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision, pages 3018–3027, 2017. 2
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6
    Google ScholarLocate open access versionFindings
  • Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to remember rare events. arXiv preprint arXiv:1703.03129, 2017. 6
    Findings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. 5
    Google ScholarLocate open access versionFindings
  • W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 202, 5, 6
    Findings
  • A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference, pages 275–1. British Machine Vision Association, 2008. 2
    Google ScholarLocate open access versionFindings
  • O. Kliper-Gross, T. Hassner, and L. Wolf. One shot similarity metric learning for action recognition. In International Workshop on Similarity-Based Pattern Recognition, pages 31–45. Springer, 2011. 2
    Google ScholarLocate open access versionFindings
  • G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015. 2
    Google ScholarLocate open access versionFindings
  • J. Lin, C. Gan, and S. Han. Temporal shift module for efficient video understanding. arXiv preprint arXiv:1811.08383, 2018. 4
    Findings
  • A. Mensch and M. Blondel. Differentiable dynamic programming for structured prediction and attention. ICML, 2018. 5, 8
    Google ScholarLocate open access versionFindings
  • A. Mishra, V. K. Verma, M. S. K. Reddy, S. Arulkumar, P. Rai, and A. Mittal. A generative approach to zero-shot and few-shot action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 372–380. IEEE, 2018. 2
    Google ScholarLocate open access versionFindings
  • M. Muller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007. 4
    Google ScholarFindings
  • T. Munkhdalai and H. Yu. Meta networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2554–2563. JMLR. org, 2017. 2
    Google ScholarLocate open access versionFindings
  • A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018. 2
    Findings
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017. 6
    Google ScholarFindings
  • H. Qi, M. Brown, and D. G. Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5822– 5830, 2018. 6
    Google ScholarLocate open access versionFindings
  • Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pages 5533–5541, 2017. 3
    Google ScholarLocate open access versionFindings
  • S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. 2016. 2
    Google ScholarFindings
  • A. Richard, H. Kuehne, A. Iqbal, and J. Gall. Neuralnetwork-viterbi: A framework for weakly supervised video learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7386–7395, 2018. 3
    Google ScholarLocate open access versionFindings
  • A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018. 2
    Findings
  • P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM international conference on Multimedia, pages 357–360. ACM, 2007. 2
    Google ScholarLocate open access versionFindings
  • G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer, 2016. 5
    Google ScholarLocate open access versionFindings
  • J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017. 2
    Google ScholarLocate open access versionFindings
  • K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 5
    Findings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. 2, 4
    Google ScholarLocate open access versionFindings
  • D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450– 6459, 2018. 3
    Google ScholarLocate open access versionFindings
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016. 2, 6
    Google ScholarLocate open access versionFindings
  • H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013. 2
    Google ScholarLocate open access versionFindings
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016. 1, 2, 4, 6
    Google ScholarLocate open access versionFindings
  • X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794– 7803, 2018. 3, 4
    Google ScholarLocate open access versionFindings
  • X. Wang and A. Gupta. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), pages 399–417, 2018. 3
    Google ScholarLocate open access versionFindings
  • Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Lowshot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7278–7286, 2018. 2
    Google ScholarLocate open access versionFindings
  • S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 305–321, 2018. 4, 5
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 803– 818, 2018. 3, 4, 5, 7
    Google ScholarLocate open access versionFindings
  • L. Zhu and Y. Yang. Compound memory networks for fewshot video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 751–766, 2018. 2, 4, 6
    Google ScholarLocate open access versionFindings
  • M. Zolfaghari, K. Singh, and T. Brox. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pages 695–712, 2018. 4
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments