Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), pp. 1945-1953, 2017.

Cited by: 201|Bibtex|Views82|Links
EI
Keywords:
individual exampledeep networkbatch normalizationbatch renormalizationentire minibatchMore(1+)
Weibo:
We hypothesized that these drawbacks are due to the fact that the activations in the model, which are in turn used by other layers as inputs, are computed differently during training than during inference. We address this with Batch Renormalization, which replaces batchnorm and e...

Abstract:

Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and differe...More

Code:

Data:

0
Introduction
  • Batch Normalization (“batchnorm” [6]) has recently become a part of the standard toolkit for training deep networks.
  • Batch normalization makes it possible to use significantly higher learning rates, and reduces the sensitivity to initialization.
  • These effects help accelerate the training, sometimes dramatically so.
  • Consider a particular node in the deep network, producing a scalar value for each input example.
Highlights
  • Batch Normalization (“batchnorm” [6]) has recently become a part of the standard toolkit for training deep networks
  • We have demonstrated that Batch Normalization, while effective, is not well suited to small or non-i.i.d. training minibatches
  • We hypothesized that these drawbacks are due to the fact that the activations in the model, which are in turn used by other layers as inputs, are computed differently during training than during inference. We address this with Batch Renormalization, which replaces batchnorm and ensures that the outputs computed by the model are dependent only on the individual examples and not the entire minibatch, during both training and inference
  • Batch Renormalization extends batchnorm with a per-dimension correction to ensure that the activations match between the training and inference networks
  • Unlike batchnorm, where the means and variances used during inference do not need to be computed until the training has completed, Batch Renormalization benefits from having these statistics directly participate in the training
  • Batch Renormalization is as easy to implement as batchnorm itself, runs at the same speed during both training and inference, and significantly improves training on small or non-i.i.d. minibatches
Results
  • To evaluate Batch Renormalization, the authors applied it to the problem of image classification.
  • The authors' baseline model is Inception v3 [13], trained on 1000 classes from ImageNet training set [9], and evaluated on the ImageNet validation data.
  • Batchnorm was used after convolution and before the ReLU [8].
  • To apply Batch Renorm, the authors swapped it into the model in place of batchnorm.
  • Both methods normalize each feature map over examples as well as over spatial locations.
Conclusion
  • The authors have demonstrated that Batch Normalization, while effective, is not well suited to small or non-i.i.d. training minibatches
  • The authors hypothesized that these drawbacks are due to the fact that the activations in the model, which are in turn used by other layers as inputs, are computed differently during training than during inference.
  • Batch Renormalization extends batchnorm with a per-dimension correction to ensure that the activations match between the training and inference networks.
  • A more extensive investigation of the effect of these parameters is a part of future work
Summary
  • Introduction:

    Batch Normalization (“batchnorm” [6]) has recently become a part of the standard toolkit for training deep networks.
  • Batch normalization makes it possible to use significantly higher learning rates, and reduces the sensitivity to initialization.
  • These effects help accelerate the training, sometimes dramatically so.
  • Consider a particular node in the deep network, producing a scalar value for each input example.
  • Results:

    To evaluate Batch Renormalization, the authors applied it to the problem of image classification.
  • The authors' baseline model is Inception v3 [13], trained on 1000 classes from ImageNet training set [9], and evaluated on the ImageNet validation data.
  • Batchnorm was used after convolution and before the ReLU [8].
  • To apply Batch Renorm, the authors swapped it into the model in place of batchnorm.
  • Both methods normalize each feature map over examples as well as over spatial locations.
  • Conclusion:

    The authors have demonstrated that Batch Normalization, while effective, is not well suited to small or non-i.i.d. training minibatches
  • The authors hypothesized that these drawbacks are due to the fact that the activations in the model, which are in turn used by other layers as inputs, are computed differently during training than during inference.
  • Batch Renormalization extends batchnorm with a per-dimension correction to ensure that the activations match between the training and inference networks.
  • A more extensive investigation of the effect of these parameters is a part of future work
Reference
  • Devansh Arpit, Yingbo Zhou, Bhargava U Kota, and Venu Govindaraju. Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. arXiv preprint arXiv:1603.01431, 2016.
    Findings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.
    Findings
  • Jacob Goldberger, Sam Roweis, Geoff Hinton, and Ruslan Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems 17, 2004.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 448–456, 2015.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
    Findings
  • Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814. Omnipress, 2010.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge, 2014.
    Google ScholarFindings
  • Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2226–2234, 2016.
    Google ScholarLocate open access versionFindings
  • Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–901, 2016.
    Google ScholarLocate open access versionFindings
  • Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. CoRR, abs/1503.03832, 2015.
    Findings
  • Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
    Google ScholarLocate open access versionFindings
  • T. Tieleman and G. Hinton. Lecture 6.5 - rmsprop. COURSERA: Neural Networks for Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments