# Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

International Conference on Machine Learning, 2015.

EI

Keywords:

covariate shiftstochastic optimizationinternal covariate shiftmini batchStochastic gradient descentMore(8+)

Weibo:

Abstract:

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating n...More

Code:

Data:

Introduction

- Deep learning has dramatically advanced the state of the art in vision, speech, and many other areas.
- Stochastic gradient descent (SGD) has proved to be an effective way of training deep networks, and SGD variants such as momentum (Sutskever et al, 2013) and Adagrad (Duchi et al, 2011) have been used to achieve state of the art performance.
- SGD optimizes the parameters Θ of the network, so as to minimize the loss.
- The gradient of the loss over a mini-batch

Highlights

- Deep learning has dramatically advanced the state of the art in vision, speech, and many other areas
- We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets
- To verify the effects of internal covariate shift on training, and the ability of Batch Normalization to combat it, we considered the problem of predicting the digit class on the MNIST dataset (LeCun et al, 1998a)
- We evaluated the following networks, all trained on the LSVRC2012 training data, and tested on the validation data: Inception: the network described at the beginning of Section 4.2, trained with the initial learning rate of 0.0015
- It is based on the premise that covariate shift, which is known to complicate the training of machine learning systems, applies to sub-networks and layers, and removing it from internal activations of the network may aid in training
- To enable stochastic optimization methods commonly used in deep network training, we perform the normalization for each mini-batch, and backpropagate the gradients through the normalization parameters

Methods

- To verify the effects of internal covariate shift on training, and the ability of Batch Normalization to combat it, the authors considered the problem of predicting the digit class on the MNIST dataset (LeCun et al, 1998a).
- The authors used a very simple network, with a 28x28 binary image as input, and 3 fully-connected hidden layers with 100 activations each.
- (b) Without BN (c) With BN by a fully-connected layer with 10 activations and cross-entropy loss.
- The authors were interested in the comparison between the baseline and batch-normalized networks, rather than achieving the state of the art performance on MNIST

Results

- While in Inception an L2 loss on the model parameters controls overfitting, in modified BN-Inception the weight of this loss is reduced by a factor of 5.
- The authors find that this improves the accuracy on the held-out validation data

Conclusion

- The authors have presented a novel mechanism for dramatically accelerating the training of deep networks.
- The authors' proposed method draws its power from normalizing activations, and from incorporating this normalization in the network architecture itself.
- This ensures that the normalization is appropriately handled by any optimization method that is being used to train the network.
- To enable stochastic optimization methods commonly used in deep network training, the authors perform the normalization for each mini-batch, and backpropagate the gradients through the normalization parameters.
- The resulting networks can be trained with saturating nonlinearities, are more tolerant to increased training rates, and often do not require Dropout for regularization

Summary

## Introduction:

Deep learning has dramatically advanced the state of the art in vision, speech, and many other areas.- Stochastic gradient descent (SGD) has proved to be an effective way of training deep networks, and SGD variants such as momentum (Sutskever et al, 2013) and Adagrad (Duchi et al, 2011) have been used to achieve state of the art performance.
- SGD optimizes the parameters Θ of the network, so as to minimize the loss.
- The gradient of the loss over a mini-batch
## Methods:

To verify the effects of internal covariate shift on training, and the ability of Batch Normalization to combat it, the authors considered the problem of predicting the digit class on the MNIST dataset (LeCun et al, 1998a).- The authors used a very simple network, with a 28x28 binary image as input, and 3 fully-connected hidden layers with 100 activations each.
- (b) Without BN (c) With BN by a fully-connected layer with 10 activations and cross-entropy loss.
- The authors were interested in the comparison between the baseline and batch-normalized networks, rather than achieving the state of the art performance on MNIST
## Results:

While in Inception an L2 loss on the model parameters controls overfitting, in modified BN-Inception the weight of this loss is reduced by a factor of 5.- The authors find that this improves the accuracy on the held-out validation data
## Conclusion:

The authors have presented a novel mechanism for dramatically accelerating the training of deep networks.- The authors' proposed method draws its power from normalizing activations, and from incorporating this normalization in the network architecture itself.
- This ensures that the normalization is appropriately handled by any optimization method that is being used to train the network.
- To enable stochastic optimization methods commonly used in deep network training, the authors perform the normalization for each mini-batch, and backpropagate the gradients through the normalization parameters.
- The resulting networks can be trained with saturating nonlinearities, are more tolerant to increased training rates, and often do not require Dropout for regularization

Funding

- While in Inception an L2 loss on the model parameters controls overfitting, in modified BN-Inception the weight of this loss is reduced by a factor of 5. We find that this improves the accuracy on the held-out validation data

Reference

- Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp. 249–256, May 2010.
- Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc’Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.
- Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011. ISSN 1532-4435.
- Gulcehre, Caglar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013.
- He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.
- Hyvarinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Netw., 13(4-5): 411–430, May 2000.
- Jiang, Jing. A literature survey on domain adaptation of statistical classifiers, 2008.
- LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998a.
- LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998b.
- Lyu, S and Simoncelli, E P. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pp. 1–8. IEEE Computer Society, Jun 23-28 2008. doi: 10.1109/CVPR.2008.4587821.
- Omnipress, 2010.
- Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013, pp. 1310–1318, 2013.
- Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014.
- Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 924–932, 2012.
- Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.
- Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
- Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 (2):227–244, October 2000.
- Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014.
- Sutskever, Ilya, Martens, James, Dahl, George E., and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. In ICML (3), volume 28 of JMLR Proceedings, pp. 1139–1147. JMLR.org, 2013.
- Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
- Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 657–665, Granada, Spain, December 2011.
- Wiesler, Simon, Richard, Alexander, Schluter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 180–184, Florence, Italy, May 2014.
- Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and Sun, Gang. Deep image: Scaling up image recognition, 2015.

Tags

Comments