A Multigrid Method for Efficiently Training Video Models

CVPR, pp. 150-159, 2019.

被引用4|引用|浏览278|来源
EI
关键词
image modelbatch sizedeep networktraining timevideo model更多(8+)
微博一下
We propose a multigrid method for fast training of video models

摘要

Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training assumes a fixed mini-batch shape: a specific n...更多

代码

数据

0
简介
  • Training deep networks (CNNs [27]) on video is more computationally intensive than training 2D CNN image models, potentially by an order of magnitude.
  • A variety of considerations go into selecting this input shape, but a common heuristic is to make the T ×H×W dimensions large in order to improve accuracy, e.g., as observed in [9, 45, 47]
  • This heuristic is only one possible choice, and in general there are trade-offs.
  • One may use a smaller number of frames and/or spatial size while simultaneously increasing the mini-batch size B
  • With such an exchange, it is possible to process the same number of epochs with lower wall-clock time because each iteration processes more examples.
  • The resulting trade-off is faster training with lower accuracy
重点内容
  • Training deep networks (CNNs [27]) on video is more computationally intensive than training 2D CNN image models, potentially by an order of magnitude
  • When the mini-batch size changes due to the long cycle, we apply the linear scaling rule [13] to adjust the learning rate by the mini-batch size scaling factor. We found that this adjustment is harmful if applied to mini-batch size changes due to the short cycle and we only adjust the learning rate when the long cycle base shape changes
  • Multigrid training always achieves a better trade-off than baseline training. Multigrid training with both the long and short cycles can iterate through 1.5× more epochs than baseline method, while only requiring 1/3.4× the number of iterations, 1/4.5× training time, and achieving higher accuracy (75.6% → 76.4%)
  • The baseline recipe trains for 100k iterations using 128 GPUs, with a mini-batch size of 2 clips per GPU (∼106 epochs) and a learning rate of 0.04, which is decreased by a factor of 10 at iteration 37.5k and 75k
  • We propose a multigrid method for fast training of video models
  • Our method varies the sampling grid and the minibatch size during training, and can process the same number of epochs using a small fraction of the computation of the baseline trainer
方法
  • The authors conduct ablation studies on the Kinetics-400 dataset [23], which requires classifying each video into one of 400 categories
  • It contains ∼240k training videos and ∼20k validation videos on which the authors report results.
  • Performance is evaluated by top-1 and top5 accuracy
  • This task is known to require more ‘temporal modeling’ to solve than Kinetics [49].
  • The authors use the same R50-SlowFast model [9, 18], with the same Kinetics pre-training as the Something-Something experiments.
结果
  • The authors compare multigrid training to baseline training in Fig. 3.
  • Multigrid training always achieves a better trade-off than baseline training.
  • Multigrid training with both the long and short cycles can iterate through 1.5× more epochs than baseline method, while only requiring 1/3.4× the number of iterations, 1/4.5× training time, and achieving higher accuracy (75.6% → 76.4%).
  • Similar to what the authors observe on Kinetics, multigrid training obtains a better trade-off than baseline training on Something-Something V2 (Table 5).
  • The default multigrid training is 5.7× faster, while achieving slightly better mAP.
  • The authors see that even for the smaller Charades dataset, with strong large-scale pretraining, multigrid training is beneficial
结论
  • The authors propose a multigrid method for fast training of video models.
  • With a single out-of-the-box setting, it works on multiple datasets and models, and consistently brings a ∼3-6× speedup with comparable or higher accuracy.
  • It works across a spectrum of hardware settings from 128 GPU distributed training to single GPU training.
总结
  • Introduction:

    Training deep networks (CNNs [27]) on video is more computationally intensive than training 2D CNN image models, potentially by an order of magnitude.
  • A variety of considerations go into selecting this input shape, but a common heuristic is to make the T ×H×W dimensions large in order to improve accuracy, e.g., as observed in [9, 45, 47]
  • This heuristic is only one possible choice, and in general there are trade-offs.
  • One may use a smaller number of frames and/or spatial size while simultaneously increasing the mini-batch size B
  • With such an exchange, it is possible to process the same number of epochs with lower wall-clock time because each iteration processes more examples.
  • The resulting trade-off is faster training with lower accuracy
  • Methods:

    The authors conduct ablation studies on the Kinetics-400 dataset [23], which requires classifying each video into one of 400 categories
  • It contains ∼240k training videos and ∼20k validation videos on which the authors report results.
  • Performance is evaluated by top-1 and top5 accuracy
  • This task is known to require more ‘temporal modeling’ to solve than Kinetics [49].
  • The authors use the same R50-SlowFast model [9, 18], with the same Kinetics pre-training as the Something-Something experiments.
  • Results:

    The authors compare multigrid training to baseline training in Fig. 3.
  • Multigrid training always achieves a better trade-off than baseline training.
  • Multigrid training with both the long and short cycles can iterate through 1.5× more epochs than baseline method, while only requiring 1/3.4× the number of iterations, 1/4.5× training time, and achieving higher accuracy (75.6% → 76.4%).
  • Similar to what the authors observe on Kinetics, multigrid training obtains a better trade-off than baseline training on Something-Something V2 (Table 5).
  • The default multigrid training is 5.7× faster, while achieving slightly better mAP.
  • The authors see that even for the smaller Charades dataset, with strong large-scale pretraining, multigrid training is beneficial
  • Conclusion:

    The authors propose a multigrid method for fast training of video models.
  • With a single out-of-the-box setting, it works on multiple datasets and models, and consistently brings a ∼3-6× speedup with comparable or higher accuracy.
  • It works across a spectrum of hardware settings from 128 GPU distributed training to single GPU training.
表格
  • Table1: Ablation Study. We perform ablations on Kinetics-400 using an R50-SlowFast network. We analyze the impact of the long cycle (Table 1a) and short cycle (Table 1b) designs. All variants of multigrid training use the same number of training iterations as our default 1.5× epoch schedule; this roughly preserves the total training FLOPs. We report wall-clock speedup relative to the baseline trained for 1.0× epochs
  • Table2: Generalization Analysis. We study how multigrid training generalizes to models both with and without ImageNet pre-training (Table 2a) and models of different temporal (Table 2b) and spatial (Table 2c) shapes. All experiments use R50-SlowFast with results on Kinetics-400. We use the default setting for multigrid training (1.5× more epochs, corresponding to 3.4× fewer iterations than baseline) in all settings. We observe that the default choice brings consistent speedup and performance gain in all cases
  • Table3: Kinetics-400 accuracy with I3D and I3D-NL. While developed on SlowFast [<a class="ref-link" id="c9" href="#r9">9</a>], multigrid training provides a consistent speedup and performance gain with I3D [<a class="ref-link" id="c3" href="#r3">3</a>] and I3D-NL [<a class="ref-link" id="c47" href="#r47">47</a>]
  • Table4: Case study: 1-GPU training on Kinetics-400. Multigrid training reduces the training time from nearly 1 week to 2 days on a single GPU. We hope the reduced training time will make video understanding research more accessible and economical
  • Table5: Results on Something-Something V2. Multigrid training achieves a better trade-off than baseline training. Results are the mean and standard deviation over 5 runs
  • Table6: Results on Charades. Multigrid training shows consistent speedups compared with the other datasets. Results are the mean and standard deviation over 5 runs
Download tables as Excel
相关工作
  • Efficient training can also be advanced through, e.g., optimization methods (e.g., [8, 24, 33, 39]), pre-training [3, 11], distributed training [13,50], or advances in hardware [22] or software/framework design [4, 6]. In this paper, we propose a complementary direction that exploits variable mini-batch shapes for fast training. Related to our method, Wang et al [47] and Feichtenhofer et al [10] initialize larger models with smaller fully-trained ones. These methods can potentially speed up training as well, and (as can be seen later) are a special case of multigrid training.

    Multi-scale training in segmentation [16] and classification [18, 38] uses multiple image crop sizes. However, the mini-batch shape remains fixed [16,18,38]. Multigrid training on the other hand uses variable mini-batch shapes. He et al [17] change the input shapes, but fix the mini-batch size. These methods shows that training with variable scales can be beneficial. Multigrid training enjoys the same property.
基金
  • Proposes to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule
  • Demonstrates a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models , datasets , and training settings
  • Shows that the training time of state-of-the-art efficient models can still be reduced significantly
  • Proposes a complementary direction that exploits variable mini-batch shapes for fast training
引用论文
  • WF Briggs, VE Henson, and Stephen F McCormick. A Multigrid Tutorial, 2nd Edition. SIAM, 2000. 2
    Google ScholarLocate open access versionFindings
  • Joao Carreira, Viorica Patraucean, Laurent Mazare, Andrew Zisserman, and Simon Osindero. Massively parallel video networks. In ECCV, 2018. 2
    Google ScholarFindings
  • Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. TVM: An automated end-to-end optimizing compiler for deep learning. In OSDI, 2018. 2
    Google ScholarLocate open access versionFindings
  • Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. Multi-fiber networks for video recognition. In ECCV, 2018. 2
    Google ScholarFindings
  • Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014. 2
    Findings
  • Dan C Ciresan, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jurgen Schmidhuber. High-performance neural networks for visual object classification. arXiv preprint arXiv:1102.0183, 2011. 3
    Findings
  • John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 2011. 2
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast networks for video recognition. In ICCV, 2011, 2, 4, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016. 2, 7
    Google ScholarLocate open access versionFindings
  • Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Largescale weakly-supervised pre-training for video action recognition. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video action transformer network. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. 2, 4, 5, 7, 8, 9
    Findings
  • Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The “Something Something” video database for learning and evaluating visual common sense. In ICCV, 2017. 2, 8
    Google ScholarLocate open access versionFindings
  • Juncai He and Jinchao Xu. MgNet: A unified framework of multigrid and convolutional neural network. Science China Mathematics, 2019. 2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. PAMI, 2015. 2
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 2, 4, 5, 6, 8
    Google ScholarLocate open access versionFindings
  • Noureldien Hussein, Efstratios Gavves, and Arnold WM Smeulders. Timeception for complex action recognition. In CVPR, 202
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 5
    Google ScholarLocate open access versionFindings
  • Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. STM: Spatiotemporal and motion encoding for action recognition. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In ISCA, 2017. 2
    Google ScholarLocate open access versionFindings
  • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 2, 5
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 2
    Findings
  • Bruno Korbar, Du Tran, and Lorenzo Torresani. SCSampler: Sampling salient clips from video for efficient action recognition. In ICCV, 2019. 5
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 2, 3, 4
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 1
    Google ScholarFindings
  • Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park, and Nojun Kwak. Motion feature network: Fixed motion filter for action recognition. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Ji Lin, Chuang Gan, and Song Han. Temporal shift module for efficient video understanding. In ICCV, 2019. 2, 8, 9
    Google ScholarLocate open access versionFindings
  • Chenxu Luo and Alan L Yuille. Grouped spatial-temporal aggregation for efficient action recognition. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Brais Martinez, Davide Modolo, Yuanjun Xiong, and Joseph Tighe. Action recognition with spatial-temporal discriminative filter banks. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • AJ Piergiovanni, Anelia Angelova, Alexander Toshev, and Michael S Ryoo. Evolving space-time neural architectures for videos. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Siyuan Qiao, Zhe Lin, Jianming Zhang, and Alan L Yuille. Neural rejuvenation: Improving deep network training by enhancing computational resource utilization. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatiotemporal representation with pseudo-3D residual networks. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. 6
    Google ScholarLocate open access versionFindings
  • Gunnar A Sigurdsson, Gul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016. 2, 8
    Google ScholarLocate open access versionFindings
  • Patrice Y Simard, David Steinkraus, and John C Platt. Best practices for convolutional neural networks applied to visual document analysis. In ICDAR, 2003. 3
    Google ScholarFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 2, 4
    Google ScholarLocate open access versionFindings
  • Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don’t decay the learning rate, increase the batch size. In ICLR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Human action recognition using factorized spatio-temporal convolutional networks. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML, 2019. 6
    Google ScholarLocate open access versionFindings
  • Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy. In NeurIPS, 2019. 4, 5
    Google ScholarLocate open access versionFindings
  • Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, 2015. 2, 4
    Google ScholarLocate open access versionFindings
  • Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. Video classification with channel-separated convolutional networks. In ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • Limin Wang, Wei Li, Wen Li, and Luc Van Gool. Appearance-and-relation networks for video classification. In CVPR, 2018. 2
    Google ScholarFindings
  • Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018. 1, 2, 4, 5, 6, 7, 8, 9
    Google ScholarLocate open access versionFindings
  • Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In CVPR, 2019. 2, 5
    Google ScholarLocate open access versionFindings
  • Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, 2018. 2, 8
    Google ScholarLocate open access versionFindings
  • Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. ImageNet training in minutes. In ICPP, 2018. 2
    Google ScholarLocate open access versionFindings
  • Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. ECO: Efficient convolutional network for online video understanding. In ECCV, 2018. 2
    Google ScholarFindings
您的评分 :
0

 

标签
评论