Aggregated Residual Transformations for Deep Neural Networks

CVPR, 2017.

Cited by: 2788|Bibtex|Views369
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We note that we present reformulations that exhibit concatenation ) or grouped convolutions ), such reformulations are not always applicable for the general form of Eqn(3), e.g., if the transformation Ti takes arbitrary forms and are heterogenous

Abstract:

We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy ex...More

Code:

Data:

Introduction
  • Research on visual recognition is undergoing a transition from “feature engineering” to “network engineering” [25, 24, 44, 34, 36, 38, 14].
  • 256-d out ing blocks of the same shape
  • This strategy is inherited by ResNets [14] which stack modules of the same topology.
  • This simple rule reduces the free choices of hyperparameters, and depth is exposed as an essential dimension in neural networks.
  • The robustness of VGGnets and ResNets has been proven by various visual recognition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks involving speech [42, 30] and language [4, 41, 20]
Highlights
  • Research on visual recognition is undergoing a transition from “feature engineering” to “network engineering” [25, 24, 44, 34, 36, 38, 14]
  • In contrast to traditional handdesigned features (e.g., SIFT [29] and HOG [5]), features learned by neural networks from large-scale data [33] require minimal human involvement during training, and can be transferred to a variety of recognition tasks [7, 10, 28]
  • 256-d out ing blocks of the same shape. This strategy is inherited by ResNets [14] which stack modules of the same topology. This simple rule reduces the free choices of hyperparameters, and depth is exposed as an essential dimension in neural networks
  • The robustness of VGGnets and ResNets has been proven by various visual recognition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks involving speech [42, 30] and language [4, 41, 20]
  • This paper further evaluates ResNeXt on a larger ImageNet-5K set and the COCO object detection dataset [27], showing consistently better accuracy than its ResNet counterparts
  • We note that we present reformulations that exhibit concatenation (Fig. 3(b)) or grouped convolutions (Fig. 3(c)), such reformulations are not always applicable for the general form of Eqn(3), e.g., if the transformation Ti takes arbitrary forms and are heterogenous
Methods
  • The authors adopt a highly modularized design following VGG/ResNets.
  • The authors' network consists of a stack of residstage output ResNet-50.
  • ResNeXt-50 (32×4d) conv1 112×112 7×7, 64, stride 2.
  • Conv2 conv3 conv4 conv5 7×7.
  • 3×3 max pool, stride 2 1×1, 64 3×3, 64 ×3 1×1, 256
Results
  • The authors emphasize that while it is relatively easy to increase accuracy by increasing capacity, methods that increase accuracy while maintaining complexity are rare in the literature.
  • The authors' experiments will show that the models improve accuracy when maintaining the model complexity and number of parameters.
  • Comparing with ResNet-50 (Table 3 top and Fig. 5 left), the 32×4d ResNeXt-50 has a validation error of 22.2%, which is 1.7% lower than the ResNet baseline’s 23.9%.
  • The authors' larger method achieves 3.58% test error on CIFAR-10 and 17.31% on CIFAR-100
Conclusion
  • The authors note that the authors present reformulations that exhibit concatenation (Fig. 3(b)) or grouped convolutions (Fig. 3(c)), such reformulations are not always applicable for the general form of Eqn(3), e.g., if the transformation Ti takes arbitrary forms and are heterogenous.
  • The authors choose to use homogenous forms in this paper because they are simpler and extensible
  • Under this simplified case, grouped convolutions in the form of Fig. 3(c) are helpful for easing implementation.
  • The authors choose to adjust the width of cardinality C the bottleneck (e.g., 4-d in Fig 1), because it can be isolated from the input and output of the block
  • This strategy introduces no change to other hyper-parameters, so is helpful for them to focus on the impact of cardinality
Summary
  • Introduction:

    Research on visual recognition is undergoing a transition from “feature engineering” to “network engineering” [25, 24, 44, 34, 36, 38, 14].
  • 256-d out ing blocks of the same shape
  • This strategy is inherited by ResNets [14] which stack modules of the same topology.
  • This simple rule reduces the free choices of hyperparameters, and depth is exposed as an essential dimension in neural networks.
  • The robustness of VGGnets and ResNets has been proven by various visual recognition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks involving speech [42, 30] and language [4, 41, 20]
  • Methods:

    The authors adopt a highly modularized design following VGG/ResNets.
  • The authors' network consists of a stack of residstage output ResNet-50.
  • ResNeXt-50 (32×4d) conv1 112×112 7×7, 64, stride 2.
  • Conv2 conv3 conv4 conv5 7×7.
  • 3×3 max pool, stride 2 1×1, 64 3×3, 64 ×3 1×1, 256
  • Results:

    The authors emphasize that while it is relatively easy to increase accuracy by increasing capacity, methods that increase accuracy while maintaining complexity are rare in the literature.
  • The authors' experiments will show that the models improve accuracy when maintaining the model complexity and number of parameters.
  • Comparing with ResNet-50 (Table 3 top and Fig. 5 left), the 32×4d ResNeXt-50 has a validation error of 22.2%, which is 1.7% lower than the ResNet baseline’s 23.9%.
  • The authors' larger method achieves 3.58% test error on CIFAR-10 and 17.31% on CIFAR-100
  • Conclusion:

    The authors note that the authors present reformulations that exhibit concatenation (Fig. 3(b)) or grouped convolutions (Fig. 3(c)), such reformulations are not always applicable for the general form of Eqn(3), e.g., if the transformation Ti takes arbitrary forms and are heterogenous.
  • The authors choose to use homogenous forms in this paper because they are simpler and extensible
  • Under this simplified case, grouped convolutions in the form of Fig. 3(c) are helpful for easing implementation.
  • The authors choose to adjust the width of cardinality C the bottleneck (e.g., 4-d in Fig 1), because it can be isolated from the input and output of the block
  • This strategy introduces no change to other hyper-parameters, so is helpful for them to focus on the impact of cardinality
Tables
  • Table1: Left) ResNet-50. (Right) ResNeXt-50 with a 32×4d template (using the reformulation in Fig. 3(c)). Inside the brackets are the shape of a residual block, and outside the brackets is the number of stacked blocks on a stage. “C=32” suggests grouped convolutions [<a class="ref-link" id="c24" href="#r24">24</a>] with 32 groups. The numbers of parameters and FLOPs are similar between these two models
  • Table2: Relations between cardinality and width (for the template of conv2), with roughly preserved complexity on a residual block. The number of parameters is ∼70k for the template of conv2. The number of FLOPs is ∼0.22 billion (# params×56×56 for conv2)
  • Table3: Ablation experiments on ImageNet-1K. (Top): ResNet50 with preserved complexity (∼4.1 billion FLOPs); (Bottom): ResNet-101 with preserved complexity (∼7.8 billion FLOPs). The error rate is evaluated on the single crop of 224×224 pixels
  • Table4: Comparisons on ImageNet-1K when the number of FLOPs is increased to 2× of ResNet-101’s. The error rate is evaluated on the single crop of 224×224 pixels. The highlighted factors are the factors that increase complexity
  • Table5: State-of-the-art models on the ImageNet-1K validation set (single-crop testing). The test size of ResNet/ResNeXt is 224×224 and 320×320 as in [<a class="ref-link" id="c15" href="#r15">15</a>] and of the Inception models is 299×299
  • Table6: Error (%) on ImageNet-5K. The models are trained on ImageNet-5K and tested on the ImageNet-1K val set, treated as a 5K-way classification task or a 1K-way classification task at test time. ResNeXt and its ResNet counterpart have similar complexity. The error is evaluated on the single crop of 224×224 pixels
  • Table7: Test error (%) and model size on CIFAR. Our results are the average of 10 runs
  • Table8: Object detection results on the COCO minival set. ResNeXt and its ResNet counterpart have similar complexity
Download tables as Excel
Related work
  • Multi-branch convolutional networks. The Inception models [38, 17, 39, 37] are successful multi-branch architectures where each branch is carefully customized. ResNets [14] can be thought of as two-branch networks where one branch is the identity mapping. Deep neural decision forests [22] are tree-patterned multi-branch networks with learned splitting functions.

    Grouped convolutions. The use of grouped convolutions dates back to the AlexNet paper [24], if not earlier. The motivation given by Krizhevsky et al [24] is for distributing the model over two GPUs. Grouped convolutions are supported by Caffe [19], Torch [3], and other libraries, mainly for compatibility of AlexNet. To the best of our knowledge, there has been little evidence on exploiting grouped convolutions to improve accuracy. A special case of grouped convolutions is channel-wise convolutions in which the number of groups is equal to the number of channels. Channel-wise convolutions are part of the separable convolutions in [35].
Funding
  • S.X. and Z.T.’s research was partly supported by NSF IIS-1618477
Reference
  • S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • G. Cantor. Uber unendliche, lineare punktmannichfaltigkeiten, arbeiten zur mengenlehre aus den jahren 1872-1884. 1884.
    Google ScholarFindings
  • R. Collobert, S. Bengio, and J. Mariethoz. Torch: a modular machine learning software library. Technical report, Idiap, 2002.
    Google ScholarFindings
  • A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun. Very deep convolutional networks for natural language processing. arXiv:1606.01781, 2016.
    Findings
  • N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
    Google ScholarLocate open access versionFindings
  • E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
    Google ScholarLocate open access versionFindings
  • D. Eigen, J. Rolfe, R. Fergus, and Y. LeCun. Understanding deep architectures using a recursive convolutional network. arXiv:1312.1847, 2013.
    Findings
  • R. Girshick. Fast R-CNN. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • S. Gross and M. Wilber. Training and investigating Residual Nets. https://github.com/facebook/fb.resnet.torch, 2016.
    Findings
  • K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. arXiv:1703.06870, 2017.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. Deep roots: Improving cnn efficiency with hierarchical filter groups. arXiv:1605.06489, 2016.
    Findings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.
    Google ScholarLocate open access versionFindings
  • Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
    Findings
  • N. Kalchbrenner, L. Espeholt, K. Simonyan, A. v. d. Oord, A. Graves, and K. Kavukcuoglu. Neural machine translation in linear time. arXiv:1610.10099, 2016.
    Findings
  • Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulo. Deep convolutional neural decision forests. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
    Google ScholarFindings
  • A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
    Google ScholarFindings
  • M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014.
    Google ScholarLocate open access versionFindings
  • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • D. G. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 2004.
    Google ScholarLocate open access versionFindings
  • A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016.
    Findings
  • P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
    Google ScholarLocate open access versionFindings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • L. Sifre and S. Mallat. Rigid-motion scattering for texture classification. arXiv:1403.1687, 2014.
    Findings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. In ICLR Workshop, 2016.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • A. Veit, M. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow network. In NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144, 2016.
    Findings
  • W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. The Microsoft 2016 Conversational Speech Recognition System. arXiv:1609.03528, 2016.
    Findings
  • S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments