# Aggregated Residual Transformations for Deep Neural Networks

CVPR, 2017.

EI

Weibo:

Abstract:

We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy ex...More

Code:

Data:

Introduction

- Research on visual recognition is undergoing a transition from “feature engineering” to “network engineering” [25, 24, 44, 34, 36, 38, 14].
- 256-d out ing blocks of the same shape
- This strategy is inherited by ResNets [14] which stack modules of the same topology.
- This simple rule reduces the free choices of hyperparameters, and depth is exposed as an essential dimension in neural networks.
- The robustness of VGGnets and ResNets has been proven by various visual recognition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks involving speech [42, 30] and language [4, 41, 20]

Highlights

- Research on visual recognition is undergoing a transition from “feature engineering” to “network engineering” [25, 24, 44, 34, 36, 38, 14]
- In contrast to traditional handdesigned features (e.g., SIFT [29] and HOG [5]), features learned by neural networks from large-scale data [33] require minimal human involvement during training, and can be transferred to a variety of recognition tasks [7, 10, 28]
- 256-d out ing blocks of the same shape. This strategy is inherited by ResNets [14] which stack modules of the same topology. This simple rule reduces the free choices of hyperparameters, and depth is exposed as an essential dimension in neural networks
- The robustness of VGGnets and ResNets has been proven by various visual recognition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks involving speech [42, 30] and language [4, 41, 20]
- This paper further evaluates ResNeXt on a larger ImageNet-5K set and the COCO object detection dataset [27], showing consistently better accuracy than its ResNet counterparts
- We note that we present reformulations that exhibit concatenation (Fig. 3(b)) or grouped convolutions (Fig. 3(c)), such reformulations are not always applicable for the general form of Eqn(3), e.g., if the transformation Ti takes arbitrary forms and are heterogenous

Methods

- The authors adopt a highly modularized design following VGG/ResNets.
- The authors' network consists of a stack of residstage output ResNet-50.
- ResNeXt-50 (32×4d) conv1 112×112 7×7, 64, stride 2.
- Conv2 conv3 conv4 conv5 7×7.
- 3×3 max pool, stride 2 1×1, 64 3×3, 64 ×3 1×1, 256

Results

- The authors emphasize that while it is relatively easy to increase accuracy by increasing capacity, methods that increase accuracy while maintaining complexity are rare in the literature.
- The authors' experiments will show that the models improve accuracy when maintaining the model complexity and number of parameters.
- Comparing with ResNet-50 (Table 3 top and Fig. 5 left), the 32×4d ResNeXt-50 has a validation error of 22.2%, which is 1.7% lower than the ResNet baseline’s 23.9%.
- The authors' larger method achieves 3.58% test error on CIFAR-10 and 17.31% on CIFAR-100

Conclusion

- The authors note that the authors present reformulations that exhibit concatenation (Fig. 3(b)) or grouped convolutions (Fig. 3(c)), such reformulations are not always applicable for the general form of Eqn(3), e.g., if the transformation Ti takes arbitrary forms and are heterogenous.
- The authors choose to use homogenous forms in this paper because they are simpler and extensible
- Under this simplified case, grouped convolutions in the form of Fig. 3(c) are helpful for easing implementation.
- The authors choose to adjust the width of cardinality C the bottleneck (e.g., 4-d in Fig 1), because it can be isolated from the input and output of the block
- This strategy introduces no change to other hyper-parameters, so is helpful for them to focus on the impact of cardinality

Summary

## Introduction:

Research on visual recognition is undergoing a transition from “feature engineering” to “network engineering” [25, 24, 44, 34, 36, 38, 14].- 256-d out ing blocks of the same shape
- This strategy is inherited by ResNets [14] which stack modules of the same topology.
- This simple rule reduces the free choices of hyperparameters, and depth is exposed as an essential dimension in neural networks.
- The robustness of VGGnets and ResNets has been proven by various visual recognition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks involving speech [42, 30] and language [4, 41, 20]
## Methods:

The authors adopt a highly modularized design following VGG/ResNets.- The authors' network consists of a stack of residstage output ResNet-50.
- ResNeXt-50 (32×4d) conv1 112×112 7×7, 64, stride 2.
- Conv2 conv3 conv4 conv5 7×7.
- 3×3 max pool, stride 2 1×1, 64 3×3, 64 ×3 1×1, 256
## Results:

The authors emphasize that while it is relatively easy to increase accuracy by increasing capacity, methods that increase accuracy while maintaining complexity are rare in the literature.- The authors' experiments will show that the models improve accuracy when maintaining the model complexity and number of parameters.
- Comparing with ResNet-50 (Table 3 top and Fig. 5 left), the 32×4d ResNeXt-50 has a validation error of 22.2%, which is 1.7% lower than the ResNet baseline’s 23.9%.
- The authors' larger method achieves 3.58% test error on CIFAR-10 and 17.31% on CIFAR-100
## Conclusion:

The authors note that the authors present reformulations that exhibit concatenation (Fig. 3(b)) or grouped convolutions (Fig. 3(c)), such reformulations are not always applicable for the general form of Eqn(3), e.g., if the transformation Ti takes arbitrary forms and are heterogenous.- The authors choose to use homogenous forms in this paper because they are simpler and extensible
- Under this simplified case, grouped convolutions in the form of Fig. 3(c) are helpful for easing implementation.
- The authors choose to adjust the width of cardinality C the bottleneck (e.g., 4-d in Fig 1), because it can be isolated from the input and output of the block
- This strategy introduces no change to other hyper-parameters, so is helpful for them to focus on the impact of cardinality

- Table1: Left) ResNet-50. (Right) ResNeXt-50 with a 32×4d template (using the reformulation in Fig. 3(c)). Inside the brackets are the shape of a residual block, and outside the brackets is the number of stacked blocks on a stage. “C=32” suggests grouped convolutions [<a class="ref-link" id="c24" href="#r24">24</a>] with 32 groups. The numbers of parameters and FLOPs are similar between these two models
- Table2: Relations between cardinality and width (for the template of conv2), with roughly preserved complexity on a residual block. The number of parameters is ∼70k for the template of conv2. The number of FLOPs is ∼0.22 billion (# params×56×56 for conv2)
- Table3: Ablation experiments on ImageNet-1K. (Top): ResNet50 with preserved complexity (∼4.1 billion FLOPs); (Bottom): ResNet-101 with preserved complexity (∼7.8 billion FLOPs). The error rate is evaluated on the single crop of 224×224 pixels
- Table4: Comparisons on ImageNet-1K when the number of FLOPs is increased to 2× of ResNet-101’s. The error rate is evaluated on the single crop of 224×224 pixels. The highlighted factors are the factors that increase complexity
- Table5: State-of-the-art models on the ImageNet-1K validation set (single-crop testing). The test size of ResNet/ResNeXt is 224×224 and 320×320 as in [<a class="ref-link" id="c15" href="#r15">15</a>] and of the Inception models is 299×299
- Table6: Error (%) on ImageNet-5K. The models are trained on ImageNet-5K and tested on the ImageNet-1K val set, treated as a 5K-way classification task or a 1K-way classification task at test time. ResNeXt and its ResNet counterpart have similar complexity. The error is evaluated on the single crop of 224×224 pixels
- Table7: Test error (%) and model size on CIFAR. Our results are the average of 10 runs
- Table8: Object detection results on the COCO minival set. ResNeXt and its ResNet counterpart have similar complexity

Related work

- Multi-branch convolutional networks. The Inception models [38, 17, 39, 37] are successful multi-branch architectures where each branch is carefully customized. ResNets [14] can be thought of as two-branch networks where one branch is the identity mapping. Deep neural decision forests [22] are tree-patterned multi-branch networks with learned splitting functions.

Grouped convolutions. The use of grouped convolutions dates back to the AlexNet paper [24], if not earlier. The motivation given by Krizhevsky et al [24] is for distributing the model over two GPUs. Grouped convolutions are supported by Caffe [19], Torch [3], and other libraries, mainly for compatibility of AlexNet. To the best of our knowledge, there has been little evidence on exploiting grouped convolutions to improve accuracy. A special case of grouped convolutions is channel-wise convolutions in which the number of groups is equal to the number of channels. Channel-wise convolutions are part of the separable convolutions in [35].

Funding

- S.X. and Z.T.’s research was partly supported by NSF IIS-1618477

Reference

- S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016.
- G. Cantor. Uber unendliche, lineare punktmannichfaltigkeiten, arbeiten zur mengenlehre aus den jahren 1872-1884. 1884.
- R. Collobert, S. Bengio, and J. Mariethoz. Torch: a modular machine learning software library. Technical report, Idiap, 2002.
- A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun. Very deep convolutional networks for natural language processing. arXiv:1606.01781, 2016.
- N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
- E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
- J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
- D. Eigen, J. Rolfe, R. Fergus, and Y. LeCun. Understanding deep architectures using a recursive convolutional network. arXiv:1312.1847, 2013.
- R. Girshick. Fast R-CNN. In ICCV, 2015.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- S. Gross and M. Wilber. Training and investigating Residual Nets. https://github.com/facebook/fb.resnet.torch, 2016.
- K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. arXiv:1703.06870, 2017.
- K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
- K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
- Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. Deep roots: Improving cnn efficiency with hierarchical filter groups. arXiv:1605.06489, 2016.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
- N. Kalchbrenner, L. Espeholt, K. Simonyan, A. v. d. Oord, A. Graves, and K. Kavukcuoglu. Neural machine translation in linear time. arXiv:1610.10099, 2016.
- Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In ICLR, 2016.
- P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulo. Deep convolutional neural decision forests. In ICCV, 2015.
- A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
- A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
- M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014.
- J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- D. G. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 2004.
- A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016.
- P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015.
- S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
- P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
- L. Sifre and S. Mallat. Rigid-motion scattering for texture classification. arXiv:1403.1687, 2014.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. In ICLR Workshop, 2016.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
- A. Veit, M. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow network. In NIPS, 2016.
- Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144, 2016.
- W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. The Microsoft 2016 Conversational Speech Recognition System. arXiv:1609.03528, 2016.
- S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.
- M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV, 2014.

Full Text

Tags

Comments