Rethinking the Inception Architecture for Computer Vision

CVPR, 2016.

Cited by: 8432|Bibtex|Views407|Links
EI
Keywords:
computer visiondeep convolutional networkhigh performanceclassification challengequality gainMore(15+)
Weibo:
We have provided several design principles to scale up convolutional networks and studied them in the context of the Inception architecture

Abstract:

Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gai...More

Code:

Data:

Introduction
  • Since the 2012 ImageNet competition [16] winning entry by Krizhevsky et al [9], their network “AlexNet” has been successfully applied to a larger variety of computer vision tasks, for example to object-detection [5], segmentation [12], human pose estimation [22], video classification [8], object tracking [23], and superresolution [3]
  • These successes spurred a new line of research that focused on finding higher performing convolutional neural networks.
  • Improvements in the network quality resulted in new application domains for convolutional networks in cases where AlexNet features could not compete with hand engineered, crafted solutions, e.g. proposal generation in detection[4]
Highlights
  • Since the 2012 ImageNet competition [16] winning entry by Krizhevsky et al [9], their network “AlexNet” has been successfully applied to a larger variety of computer vision tasks, for example to object-detection [5], segmentation [12], human pose estimation [22], video classification [8], object tracking [23], and superresolution [3]. These successes spurred a new line of research that focused on finding higher performing convolutional neural networks
  • One interesting observation was that gains in the classification performance tend to transfer to significant quality gains in a wide variety of application domains
  • This means that architectural improvements in deep convolutional architecture can be utilized for improving performance for most other computer vision tasks that are increasingly reliant on high quality, learned visual features
  • We have provided several design principles to scale up convolutional networks and studied them in the context of the Inception architecture
  • This guidance can lead to high performance vision networks that have a relatively modest computation cost compared to simpler, more monolithic architectures
Methods
  • The authors have trained the networks with stochastic gradient utilizing the TensorFlow [1] distributed machine learning system using 50 replicas running each on a NVidia Kepler GPU with batch size 32 for 100 epochs.
  • The authors used a learning rate of 0.045, decayed every two epoch using an exponential rate of 0.94.
  • Gradient clipping [14] with threshold 2.0 was found to be useful to stabilize the training.
  • Model evaluations are performed using a running average of the parameters computed over time.
Results
  • Table 3 shows the experimental results about the recognition performance of the proposed architecture (Inception-.
  • Cost Bn Ops v2) as described in Section 6.
  • Each Inception-v2 line shows the result of the cumulative changes including the highlighted new modification plus all the earlier ones.
  • Label Smoothing refers to method described in Section 7.
  • The authors are referring to the model in last row of Table 3 as Inception-v3 and evaluate its performance in the multi-crop and ensemble settings
Conclusion
  • The authors have provided several design principles to scale up convolutional networks and studied them in the context of the Inception architecture.
  • The authors' highest quality version of Inception-v2 reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification, setting a new state of the art.
  • This is achieved with relatively modest (2.5×) increase in computational cost compared to the network described in Ioffe et al [7].
  • The combination of lower parameter count and additional regularization with batch-normalized auxiliary classifiers and label-smoothing allows for training high quality networks on relatively modest sized training sets
Summary
  • Introduction:

    Since the 2012 ImageNet competition [16] winning entry by Krizhevsky et al [9], their network “AlexNet” has been successfully applied to a larger variety of computer vision tasks, for example to object-detection [5], segmentation [12], human pose estimation [22], video classification [8], object tracking [23], and superresolution [3]
  • These successes spurred a new line of research that focused on finding higher performing convolutional neural networks.
  • Improvements in the network quality resulted in new application domains for convolutional networks in cases where AlexNet features could not compete with hand engineered, crafted solutions, e.g. proposal generation in detection[4]
  • Methods:

    The authors have trained the networks with stochastic gradient utilizing the TensorFlow [1] distributed machine learning system using 50 replicas running each on a NVidia Kepler GPU with batch size 32 for 100 epochs.
  • The authors used a learning rate of 0.045, decayed every two epoch using an exponential rate of 0.94.
  • Gradient clipping [14] with threshold 2.0 was found to be useful to stabilize the training.
  • Model evaluations are performed using a running average of the parameters computed over time.
  • Results:

    Table 3 shows the experimental results about the recognition performance of the proposed architecture (Inception-.
  • Cost Bn Ops v2) as described in Section 6.
  • Each Inception-v2 line shows the result of the cumulative changes including the highlighted new modification plus all the earlier ones.
  • Label Smoothing refers to method described in Section 7.
  • The authors are referring to the model in last row of Table 3 as Inception-v3 and evaluate its performance in the multi-crop and ensemble settings
  • Conclusion:

    The authors have provided several design principles to scale up convolutional networks and studied them in the context of the Inception architecture.
  • The authors' highest quality version of Inception-v2 reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification, setting a new state of the art.
  • This is achieved with relatively modest (2.5×) increase in computational cost compared to the network described in Ioffe et al [7].
  • The combination of lower parameter count and additional regularization with batch-normalized auxiliary classifiers and label-smoothing allows for training high quality networks on relatively modest sized training sets
Tables
  • Table1: The outline of the proposed network architecture. The output size of each module is the input size of the next one. We are using variations of reduction technique depicted
  • Table2: Comparison of recognition performance when the size of the receptive field varies, but the computational cost is constant
  • Table3: Single crop experimental results comparing the cumulative effects on the various contributing factors. We compare our numbers with the best published single-crop inference for Ioffe at al [<a class="ref-link" id="c7" href="#r7">7</a>]. For the “Inception-v3-” lines, the changes are cumulative and each subsequent line includes the new change in addition to the previous ones. The last line is referring to all the changes is what we refer to as “Inception-v3” below. Unfortunately, He et al [<a class="ref-link" id="c6" href="#r6">6</a>] reports the only 10-crop evaluation results, but not single crop results, which is reported in the Table 4 below
  • Table4: Single-model, multi-crop experimental results comparing the cumulative effects on the various contributing factors. We compare our numbers with the best published single-model inference results on the ILSVRC 2012 classification benchmark
  • Table5: Ensemble evaluation results comparing multi-model, multi-crop reported results. Our numbers are compared with the best published ensemble inference results on the ILSVRC 2012 classification benchmark. ∗All results, but the top-5 ensemble result reported are on the validation set. The ensemble yielded 3.46% top-5 error on the validation set
Download tables as Excel
Funding
  • We have evaluated all the 50000 examples as well and the results were roughly 0.1% worse in top-5 error and around 0.2% in top-1 error
Reference
  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. 7
    Google ScholarFindings
  • W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In Proceedings of The 32nd International Conference on Machine Learning, 2015. 1
    Google ScholarLocate open access versionFindings
  • C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In Computer Vision–ECCV 2014, pages 184–199. Springer, 2014. 1
    Google ScholarLocate open access versionFindings
  • D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2155–2162. IEEE, 2011, 7
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 1, 7
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015. 1, 8
    Findings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015. 3, 5, 8
    Google ScholarLocate open access versionFindings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725–1732. IEEE, 2014. 1
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 1
    Google ScholarLocate open access versionFindings
  • A. Lavin. Fast algorithms for convolutional neural networks. arXiv preprint arXiv:1509.09308, 2015. 1
    Findings
  • C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. arXiv preprint arXiv:1409.5185, 2014. 5
    Findings
  • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 1
    Google ScholarLocate open access versionFindings
  • Y. Movshovitz-Attias, Q. Yu, M. C. Stumpe, V. Shet, S. Arnoud, and L. Yatziv. Ontological supervision for fine grained classification of street view storefronts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1693–1702, 2015. 1
    Google ScholarLocate open access versionFindings
  • R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012. 7
    Findings
  • D. C. Psichogios and L. H. Ungar. Svd-net: an algorithm that automatically selects network structure. IEEE transactions on neural networks/a publication of the IEEE Neural Networks Council, 5(3):513–515, 1993. 1
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. 2014. 1, 8
    Google ScholarFindings
  • F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. arXiv preprint arXiv:1503.03832, 2015. 1
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 8
    Findings
  • I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 1139–1147. JMLR Workshop and Conference Proceedings, May 2013. 7
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. 1, 2, 4, 5, 8
    Google ScholarLocate open access versionFindings
  • T. Tieleman and G. Hinton. Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012. Accessed: 201511-05. 7
    Google ScholarLocate open access versionFindings
  • A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1653–1660. IEEE, 2014. 1
    Google ScholarLocate open access versionFindings
  • N. Wang and D.-Y. Yeung. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pages 809–817, 2013. 1
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments