Group Normalization
International Journal of Computer Vision, pp. 742-755, 2020.
EI
Weibo:
Abstract:
Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for t...More
Code:
Data:
Introduction
- Batch Normalization (Batch Norm or BN) [1] has been established as a very effective component in deep learning, largely helping push the frontier in computer vision [2,3] and beyond [4].
- BN normalizes the features by the mean and variance computed within abatch.
- This has been shown by many practices to ease optimization and enable very deep networks to converge.
- The stochastic uncertainty of the batch statistics acts as a regularizer that can benefit generalization.
- It is required for BN to work with a sufficiently large batch size (e.g., 32 per worker1 [1,2,3]).
- A small batch leads to inaccurate estimation of the batch statistics, and reducing BN’s batch size increases the model error dramatically (Figure 1)
Highlights
- Batch Normalization (Batch Norm or Batch Norm) [1] has been established as a very effective component in deep learning, largely helping push the frontier in computer vision [2,3] and beyond [4]
- To study Group Norm/Batch Norm compared to no normalization, we consider VGG-16 [57] that can be healthily trained without normalization layers
- Group Norm improves over Batch Norm* by 1.1 box Average Precision and 0.8 mask Average Precision
- On the contrary, applying Batch Norm to the box head does not provide satisfactory result and is ∼9 Average Precision worse — in detection, the batch of RoIs are sampled from the same image and their distribution is not i.i.d., and the non-i.i.d. distribution is an issue that degrades Batch Norm’s batch statistics estimation [35]
- We have presented Group Norm as an effective normalization layer without exploiting the batch dimension
- On ResNet-50 trained in ImageNet, Group Norm has 10.6% lower error than its Batch Norm counterpart when using a batch size of 2; when using typical batch sizes, Group Norm is comparably good with Batch Norm and outperforms other normalization variants
- That Batch Norm has been so influential that many state-of-the-art systems and their hyper-parameters have been designed for it, which may not be optimal for Group Norm-based models
Methods
- 4.1 Image Classification in ImageNet. Implementation details.
- As standard practice [3,52], the authors use 8 GPUs to train all models, and the batch mean and variance of BN are computed within each GPU.
- The authors use 1 to initialize all γ parameters, except for each residual block’s last normalization layer where the authors initialize γ by 0 following [54].
- The authors train 100 epochs for all models, and decrease the learning rate by 10× at 30, 60, and 90 epochs.
- Other implementation details follow [52]
Results
- Results and analysis of VGG models
To study GN/BN compared to no normalization, the authors consider VGG-16 [57] that can be healthily trained without normalization layers. - Table 4 shows the comparison of GN vs BN* on Mask R-CNN using a conv4 backbone (“C4” [10])
- This C4 variant uses ResNet’s layers of up to conv4 to extract feature maps, and ResNet’s conv5 layers as the Region-of-Interest (RoI) heads for classification and regression.
- As they are inherited from the pre-trained model, the backbone and head both involve normalization layers
- On this baseline, GN improves over BN* by 1.1 box AP and 0.8 mask AP.
Conclusion
- The authors have presented GN as an effective normalization layer without exploiting the batch dimension.
- The authors have evaluated GN’s behaviors in a variety of applications.
- That BN has been so influential that many state-of-the-art systems and their hyper-parameters have been designed for it, which may not be optimal for GN-based models.
- It is possible that re-designing the systems or searching new hyper-parameters for GN will give better results
Summary
Introduction:
Batch Normalization (Batch Norm or BN) [1] has been established as a very effective component in deep learning, largely helping push the frontier in computer vision [2,3] and beyond [4].- BN normalizes the features by the mean and variance computed within abatch.
- This has been shown by many practices to ease optimization and enable very deep networks to converge.
- The stochastic uncertainty of the batch statistics acts as a regularizer that can benefit generalization.
- It is required for BN to work with a sufficiently large batch size (e.g., 32 per worker1 [1,2,3]).
- A small batch leads to inaccurate estimation of the batch statistics, and reducing BN’s batch size increases the model error dramatically (Figure 1)
Methods:
4.1 Image Classification in ImageNet. Implementation details.- As standard practice [3,52], the authors use 8 GPUs to train all models, and the batch mean and variance of BN are computed within each GPU.
- The authors use 1 to initialize all γ parameters, except for each residual block’s last normalization layer where the authors initialize γ by 0 following [54].
- The authors train 100 epochs for all models, and decrease the learning rate by 10× at 30, 60, and 90 epochs.
- Other implementation details follow [52]
Results:
Results and analysis of VGG models
To study GN/BN compared to no normalization, the authors consider VGG-16 [57] that can be healthily trained without normalization layers.- Table 4 shows the comparison of GN vs BN* on Mask R-CNN using a conv4 backbone (“C4” [10])
- This C4 variant uses ResNet’s layers of up to conv4 to extract feature maps, and ResNet’s conv5 layers as the Region-of-Interest (RoI) heads for classification and regression.
- As they are inherited from the pre-trained model, the backbone and head both involve normalization layers
- On this baseline, GN improves over BN* by 1.1 box AP and 0.8 mask AP.
Conclusion:
The authors have presented GN as an effective normalization layer without exploiting the batch dimension.- The authors have evaluated GN’s behaviors in a variety of applications.
- That BN has been so influential that many state-of-the-art systems and their hyper-parameters have been designed for it, which may not be optimal for GN-based models.
- It is possible that re-designing the systems or searching new hyper-parameters for GN will give better results
Tables
- Table1: Comparison of error rates with a batch size of 32 images/GPU, on ResNet-50 in the ImageNet validation set. The error curves are in Figure 4
- Table2: Sensitivity to batch sizes. We show ResNet-50’s validation error (%) in ImageNet. The last row shows the differences between BN and GN. The error curves are in Figure 5. This table is visualized in Figure 1
- Table3: Group division. We show ResNet-50’s validation error (%) in ImageNet, trained with 32 images/GPU. (Left): a given number of groups. (Right): a given number of channels per group. The last rows show the differences with the best number
- Table4: Detection and segmentation results in COCO, using Mask R-CNN with the ResNet-50 C4 backbone. BN* means BN is frozen
- Table5: Detection and segmentation results in COCO, using Mask R-CNN with ResNet-50 FPN and a 4conv1fc bounding box head. BN* means BN is frozen
- Table6: Detection and segmentation results in COCO using Mask R-CNN and FPN. Here BN* is the default Detectron baseline [<a class="ref-link" id="c59" href="#r59">59</a>], and GN is applied to the backbone, box head, and mask head. “long” means training with more iterations
- Table7: COCO models trained from scratch using Mask R-CNN and FPN
- Table8: Video classification in Kinetics: ResNet-50 I3D’s top-1/5 accuracy (%)
Related work
- Normalization. Normalization layers in deep networks had been widely used before the development of BN. Local Response Normalization (LRN) [26,27,28] was a component in AlexNet [28] and following models [29,30,31]. LRN computes the statistics in a small neighborhood for each pixel.
Batch Normalization [1] performs more global normalization along the batch dimension (and as importantly, it suggests to do this for all layers). But the concept of “batch” is not always present, or it may change from time to time. For example, batch-wise normalization is not legitimate at inference time, so the mean and variance are pre-computed from the training set [1], often by running average; consequently, there is no normalization performed when testing. The pre-computed statistics may also change when the target data distribution changes [32]. These issues lead to inconsistency at training, transferring, and testing time. In addition, as aforementioned, reducing the batch size can have dramatic impact on the estimated batch statistics.
Reference
- Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML. (2015)
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR. (2016)
- He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of go without human knowledge. Nature (2017)
- Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: ICLR Workshop. (2016)
- Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. (2017)
- Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR. (2017)
- Girshick, R.: Fast R-CNN. In: ICCV. (2015)
- Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS. (2015)
- He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV. (2017)
- Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015)
- Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV. (2015)
- Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR. (2017)
- Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
- Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR. (2005)
- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. IJCV (2015)
- Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv:1607.06450 (2016)
- Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022 (2016)
- Salimans, T., Kingma, D.P.: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: NIPS. (2016)
- Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: ECCV. (2014)
- Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The Kinetics human action video dataset. arXiv:1705.06950 (2017)
- Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature (1986)
- Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (1997)
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS. (2014)
- Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR. (2017)
- Lyu, S., Simoncelli, E.P.: Nonlinear image representation using divisive normalization. In: CVPR. (2008)
- Jarrett, K., Kavukcuoglu, K., LeCun, Y., et al.: What is the best multi-stage architecture for object recognition? In: ICCV. (2009)
- Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS. (2012)
- Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. In: ECCV. (2014)
- Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. In: ICLR. (2014)
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)
- Rebuffi, S.A., Bilen, H., Vedaldi, A.: Learning multiple visual domains with residual adapters. In: NIPS. (2017)
- Arpit, D., Zhou, Y., Kota, B., Govindaraju, V.: Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. In: ICML. (2016)
- Ren, M., Liao, R., Urtasun, R., Sinz, F.H., Zemel, R.S.: Normalizing the normalizers: Comparing and extending network normalization schemes. In: ICLR. (2017)
- Ioffe, S.: Batch renormalization: Towards reducing minibatch dependence in batchnormalized models. In: NIPS. (2017)
- Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., Yu, G., Sun, J.: MegDet: A large mini-batch object detector. In: CVPR. (2018)
- Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: NIPS. (2012)
- Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
- Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: CVPR. (2017)
- Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In: CVPR. (2018)
- Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV (2001)
- Jegou, H., Douze, M., Schmid, C., Perez, P.: Aggregating local descriptors into a compact image representation. In: CVPR. (2010)
- Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: CVPR. (2007)
- Dieleman, S., De Fauw, J., Kavukcuoglu, K.: Exploiting cyclic symmetry in convolutional neural networks. In: ICML. (2016)
- Cohen, T., Welling, M.: Group equivariant convolutional networks. In: ICML. (2016)
- Heeger, D.J.: Normalization of cell responses in cat striate cortex. Visual neuroscience (1992)
- Schwartz, O., Simoncelli, E.P.: Natural signal statistics and sensory gain control. Nature neuroscience (2001)
- Simoncelli, E.P., Olshausen, B.A.: Natural image statistics and neural representation. Annual review of neuroscience (2001)
- Carandini, M., Heeger, D.J.: Normalization as a canonical neural computation. Nature Reviews Neuroscience (2012)
- Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. (2017)
- Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: Operating Systems Design and Implementation (OSDI). (2016)
- Gross, S., Wilber, M.: Training and investigating Residual Nets. https://github.com/facebook/fb.resnet.torch (2016)
- He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In: ICCV. (2015)
- Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677 (2017)
- Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 (2014)
- Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv:1606.04838 (2016)
- Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR. (2015)
- Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: ICCV. (2017)
- Girshick, R., Radosavovic, I., Gkioxari, G., Dollar, P., He, K.: Detectron. https://github.com/facebookresearch/detectron (2018)
- Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. (2017)
- Ren, S., He, K., Girshick, R., Zhang, X., Sun, J.: Object detection networks on convolutional feature maps. TPAMI (2017)
- Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: DetNet: A backbone network for object detection. arXiv:1804.06215 (2018)
- Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR.
Full Text
Tags
Comments