Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

AAAI, pp. 4278-4284, 2017.

Cited by: 4834|Bibtex|Views312|Links
EI
Keywords:
inception networkhybrid inception versionconvolutional networkinception architecturedeep convolutional networkMore(16+)
Weibo:
We studied how the introduction of residual connections leads to dramatically improved training speed for the Inception architecture

Abstract:

Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more trad...More

Code:

Data:

Introduction
Highlights
  • Object recognition is a central task for computer vision and artificial intelligence in general
  • Before 2012, specialized solutions were required for each specific application domain
  • In this work we study the combination of two of the most recent ideas: Residual connections (He et al 2015) and the latest revised version of the Inception architecture (Szegedy et al 2015b)
  • We studied how the introduction of residual connections leads to dramatically improved training speed for the Inception architecture
  • Our latest models outperform all our previous networks, just by virtue of the increased model size, while keeping the overall number of parameters and computational cost in check compared to competing approaches
Methods
  • The authors have trained the networks with stochastic gradient descent, utilizing the TensorFlow (Abadi et al 2015) distributed machine learning system using 20 replicas, each running a NVidia Kepler GPU.
  • Gradient clipping (Pascanu, Mikolov, and Bengio 2012) was found to be useful to stabilize the training.
  • Model evaluations are performed using a running average of the parameters computed over time
Results
  • First the authors observe the evolution of the top-1 and top-5 validation-error of the four variants during training.
  • The difference is about 0.3% for top-1 error and about 0.15% for the top-5 error.
  • Since the differences are consistent, the authors think the comparison between the curves is a fair one.
  • The authors have rerun the multi-crop and ensemble results on the complete validation set consisting of 50,000 images.
  • The final ensemble result was performed on the test set and sent to the ILSVRC test server
Conclusion

  • The authors studied how the introduction of residual connections leads to dramatically improved training speed for the Inception architecture.
  • The authors' latest models outperform all the previous networks, just by virtue of the increased model size, while keeping the overall number of parameters and computational cost in check compared to competing approaches
Summary
  • Introduction:

    Object recognition is a central task for computer vision and artificial intelligence in general.
  • Convolutional neural networks go back to the 1980s (Fukushima 1980) and (LeCun et al 1989), but recent good results (Krizhevsky, Sutskever, and Hinton 2012) on the large scale ImageNet image-recognition benchmark ILSVRC (Russakovsky et al 2014) has lead to a revived interest in their use.
  • The same neural network architecture “AlexNet” (Krizhevsky, Sutskever, and Hinton 2012) has been applied to a large number of application domains with good results
  • Methods:

    The authors have trained the networks with stochastic gradient descent, utilizing the TensorFlow (Abadi et al 2015) distributed machine learning system using 20 replicas, each running a NVidia Kepler GPU.
  • Gradient clipping (Pascanu, Mikolov, and Bengio 2012) was found to be useful to stabilize the training.
  • Model evaluations are performed using a running average of the parameters computed over time
  • Results:

    First the authors observe the evolution of the top-1 and top-5 validation-error of the four variants during training.
  • The difference is about 0.3% for top-1 error and about 0.15% for the top-5 error.
  • Since the differences are consistent, the authors think the comparison between the curves is a fair one.
  • The authors have rerun the multi-crop and ensemble results on the complete validation set consisting of 50,000 images.
  • The final ensemble result was performed on the test set and sent to the ILSVRC test server
  • Conclusion:


    The authors studied how the introduction of residual connections leads to dramatically improved training speed for the Inception architecture.
  • The authors' latest models outperform all the previous networks, just by virtue of the increased model size, while keeping the overall number of parameters and computational cost in check compared to competing approaches
Tables
  • Table1: Single-crop – single-model experimental results. Reported on the non-blacklisted subset of the validation set of ILSVRC 2012
  • Table2: Table 2
  • Table3: Table 3
  • Table4: Ensemble results with 144-crop/dense evaluation. Reported on the all 50,000 images of the validation set of ILSVRC 2012. The second column (N) denotes how many models were ensembled. For Inceptionv4(+Residual), the ensemble consists of one pure Inceptionv4 and three Inception-ResNet-v2 models and were evaluated both on the validation and on the test-set. The test-set performance was 3.08% top-5 error verifying that we don’t over-fit on the validation set
Download tables as Excel
Related work
  • Convolutional networks have become popular in large scale image recognition tasks after (Krizhevsky, Sutskever, and Hinton 2012). Some of the next important milestones were Network-in-network by (Lin, Chen, and Yan 2013), VGGNet by (Simonyan and Zisserman 2014) and GoogLeNet (Inception-v1) by (Szegedy et al 2015a).

    Residual connections were introduced in (He et al 2015) in which they give convincing theoretical and practical evidence for the advantages of utilizing additive merging of signals both for image recognition, and especially for object detection. The authors argued that residual connections are inherently necessary for training very deep convolutional models. Our findings do not seem to support this view, at least for image recognition. However it might require more experiments with even deeper networks to fully understand the true benefits of residual connections. In the experimental section we demonstrate that it is not very difficult to train very deep competitive networks without utilizing residual connections. However the use of residual connections seems to improve the training speed greatly, which is alone a great argument for their use.
Funding
  • Presents several new streamlined architectures for both residual and nonresidual Inception networks
  • Demonstrates how proper activation scaling stabilizes the training of very wide residual Inception networks
  • Achieves 3.08% top-5 error on the test set of the ImageNet classification challenge
  • Studies the combination of two of the most recent ideas: Residual connections and the latest revised version of the Inception architecture
  • Has studied whether Inception without residual connections can be made more efficient by making it deeper and wider
Reference
  • Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Mane, D.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Viegas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
    Google ScholarLocate open access versionFindings
  • Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M.; Senior, A.; Tucker, P.; Yang, K.; Le, Q. V.; et al. 201Large scale distributed deep networks. In Advances in Neural Information Processing Systems, 1223–1231.
    Google ScholarLocate open access versionFindings
  • Dong, C.; Loy, C. C.; He, K.; and Tang, X. 2014. Learning a deep convolutional network for image super-resolution. In Computer Vision–ECCV 2014. Springer. 184–199.
    Google ScholarLocate open access versionFindings
  • Fukushima, K. 1980. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics 36(4):193–202.
    Google ScholarLocate open access versionFindings
  • Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
    Findings
  • Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, 448–456.
    Google ScholarLocate open access versionFindings
  • Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; and Fei-Fei, L. 2014. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 1725–1732. IEEE.
    Google ScholarFindings
  • Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
    Google ScholarLocate open access versionFindings
  • LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural computation 1(4):541–551.
    Google ScholarLocate open access versionFindings
  • Lin, M.; Chen, Q.; and Yan, S. 2013. Network in network. arXiv preprint arXiv:1312.4400.
    Findings
  • Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440.
    Google ScholarLocate open access versionFindings
  • Pascanu, R.; Mikolov, T.; and Bengio, Y. 2012. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063.
    Findings
  • Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 20Imagenet large scale visual recognition challenge.
    Google ScholarFindings
  • Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
    Findings
  • Sutskever, I.; Martens, J.; Dahl, G.; and Hinton, G. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, 1139– 1147. JMLR Workshop and Conference Proceedings.
    Google ScholarLocate open access versionFindings
  • Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015a. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9.
    Google ScholarLocate open access versionFindings
  • Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2015b. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567.
    Findings
  • Tieleman, T., and Hinton, G. Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012. Accessed: 201511-05.
    Google ScholarLocate open access versionFindings
  • Toshev, A., and Szegedy, C. 2014. Deeppose: Human pose estimation via deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, 1653–1660. IEEE.
    Google ScholarFindings
  • Wang, N., and Yeung, D.-Y. 2013. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, 809–817.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments