AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
This paper proposes Stochastic Normalization to improve the widely used batch normalization module in a dropout-like way to battle against over-fitting
NIPS 2020, (2020)
Fine-tuning pre-trained deep networks on a small dataset is an important component in the deep learning pipeline. A critical problem in fine-tuning is how to avoid over-fitting when data are limited. Existing efforts work from two aspects: (1) impose regularization on parameters or features; (2) transfer prior knowledge to fine-tuning by ...More
PPT (Upload PPT)
- Training deep networks (Szegedy et al, 2015; He et al, 2016b; Huang et al, 2017) from scratch requires large amounts of data.
- For each new task at hand, it is unrealistic to collect a new dataset at the scale of ImageNet. Thanks to the release of pre-trained deep networks in PyTorch (Benoit et al, 2019) and TensorFlow (Abadi et al, 2016), practitioners can benefit from deep learning (LeCun et al, 2015) even with a small amount of data.
- The practice of transferring pre-trained parameters, a.k.a. fine-tuning, is prevalent in both computer vision (Jung et al, 2015) and natural language processing (Devlin et al, 2019)
- Training deep networks (Szegedy et al, 2015; He et al, 2016b; Huang et al, 2017) from scratch requires large amounts of data
- With the rapid development of transfer learning, medical image analysis can benefit from deep learning even with a small amount of data
- We design experiments with 5% samples, 10% samples, and 15% samples. The task in this dataset is multi-label binary classification and the evaluation metric is the average AUC for the fourteen diseases
- This paper proposes Stochastic Normalization (StochNorm) to improve the widely used batch normalization module in a dropout-like way to battle against over-fitting
- This paper proposes a new network module called StochNorm as the basic building block of deep neural networks
- Experimental results indicate that our method can outperform state-of-the-art fine-tuning methods over four datasets when there are limited data
- Its broader impact depends on the usage scenario of fine-tuning in deep learning applications
- StochNorm is compared with several fine-tuning methods: vanilla fine-tuning; L2-SP (Li et al, 2018) which regularizes the weight parameters around pre-trained parameters to alleviate catastrophic forgetting; DELTA (Li et al, 2019b) which selects features with a supervised attention mechanism; and BSS (Chen et al, 2019) which penalize small eigenvalues of feature representations to protect training from negative transfer.
- Hyper-parameters for each method are selected on validation data.
- The authors follow the train/validation/test partition of each dataset.
- Medical image analysis.
- With the rapid development of transfer learning, medical image analysis can benefit from deep learning even with a small amount of data.
- The authors design experiments with 5% samples, 10% samples, and 15% samples.
- The task in this dataset is multi-label binary classification and the evaluation metric is the average AUC for the fourteen diseases.
- How to alleviate over-fitting in fine-tuning with small datasets is an important problem.
- This paper proposes a new network module called StochNorm as the basic building block of deep neural networks
- It can greatly improve fine-tuning of pretrained models in the small data regime.
- Its broader impact depends on the usage scenario of fine-tuning in deep learning applications.
- It may inspire some researchers for further investigation of regularization techniques
- Table1: Comparing StochNorm and existing methods on regularization type and knowledge transfer. left) lists primary regularization techniques. <a class="ref-link" id="cLi_et+al_2018_a" href="#rLi_et+al_2018_a">Li et al (2018</a>) regularize the parameters near their pre-trained values, <a class="ref-link" id="cLi_et+al_2019_b" href="#rLi_et+al_2019_b">Li et al (2019b</a>) regularize the features near features computed by pre-trained networks, and <a class="ref-link" id="cChen_et+al_2019_a" href="#rChen_et+al_2019_a">Chen et al (2019</a>) penalize small eigenvalues of feature representations. As an alternative to parameter regularization and feature regularization, the proposed StochNorm regularizes fine-tuning by module design, which we interpret as “architecture regularization.”. right) lists whether each type of knowledge is transferred during fine-tuning, with a focus on commonly used ConvNets. Knowledge-free layers like max-pooling and ReLU function are omitted in the table. Usually ConvNets are constructed by stacking Conv-BN-ReLU blocks, followed by a task-specific fully-connected layer. It is a common belief that the knowledge in fully-connected layers is task-specific and cannot be transferred. Transferring learnable parameters (weight and bias) is as easy as just reusing them. Nevertheless, moving statistics in BN layers are simply discarded due to the characteristic behavior of BN (see Section 4.2). The proposed StochNorm also transfers moving statistics of pre-trained networks to exploit prior knowledge in pre-trained networks better
- Table2: Average performances (AUC) of diagnosing different pathologies on NIH Chest X-ray
- Table3: Top-1 Accuracy (%) of StochNorm and different methods (Backbone: ResNet-50)
- Table4: Accuracy of StochNorm integrated with different methods (Backbone: ResNet-50)
- Our work is related to regularization and normalization techniques used in deep learning, which are reviewed respectively in the following.
2.1 Normalization Techniques
Normalizing input features helps optimization because widely used first-order optimization algorithms such as SGD work better on more isotropic landscape (Boyd & Vandenberghe, 2004). Later Ioffe & Szegedy (2015) propose Batch Normalization (BN) to normalize intermediate feature maps by statistics computed with mini-batch samples and find that it greatly helps training of deep networks.
Inspired by Ioffe & Szegedy (2015), many normalization techniques are introduced to deal with different learning scenarios. Layer Normalization (Ba et al, 2016) and Recurrent Batch Normalization (Cooijmans et al, 2017) are effective in recurrent neural networks, Group Normalization (Wu & He, 2018) is designed for object detection, Instance Normalization (Ulyanov et al, 2016) fastens the neural stylization, Weight Normalization (Salimans & Kingma, 2016) speeds up convergence of SGD by a simple re-parameterization, and Spectral Normalization (Miyato et al, 2018) addresses the mode collapse problem in generative adversarial networks. Shekhovtsov & Flach (2018) interprets BN as Bayesian learning, and proposes how to incorporate Bayesian learning into other normalization modules. These normalization modules are tailored to specific optimization problems but are not related to fine-tuning. Among them, BN is the most widely used normalization module in deep learning. Thus, this paper focuses on ConvNets normalized by BN layers.
- Acknowledgments and Disclosure of Funding This work was supported by the National Natural Science Foundation of China (61772299, 71690231), Beijing Nova Program (Z201100006820041), University S&T Innovation Plan by the Ministry of Education of China
Study subjects and analysis
To evaluate StochNorm, we apply it to four visual recognition tasks. Experimental results indicate that our method can outperform state-of-the-art fine-tuning methods over four datasets when there are limited data. We also conduct insight analysis and ablation study to better understand StochNorm
standard datasets: 4
Datasets. The evaluation is conducted on four standard datasets. CUB-200-2011 (Welinder et al, 2010) is a dataset for fine-grained bird recognition with 200 bird species and 11, 788 images
bird species: 200
The evaluation is conducted on four standard datasets. CUB-200-2011 (Welinder et al, 2010) is a dataset for fine-grained bird recognition with 200 bird species and 11, 788 images. It is an extended version of the CUB-200 dataset
patients with fourteen disease labels: 30805
The dataset contains 10, 000 aircraft images, with 100 images for each of the 100 categories. NIH Chest X-ray (Wang et al, 2017) consists of 112, 120 frontal-view X-ray images of 30, 805 patients with fourteen disease labels (each image can have multiple labels). Compared methods
In this case, StochNorm gets an average of 4.1% increase compared with vanilla fine-tuning, demonstrating the regularization effect of StochNorm. Compared with state-of-the-art fine-tuning methods, StochNorm is superior across a wide spectrum of sampling rates for these three datasets. It is worth to note that we work on avoiding over-fitting in fine-tuning
- Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., and Isard, M. Tensorflow: a system for large-scale machine learning. In OSDI, 2016.
- Ba, L. J., Kiros, J. R., and Hinton, G. E. Layer normalization. In NeurIPS, 2016.
- Benoit, S., Zachary, D., Soumith, C., Sam, G., Adam, P., Francisco, M., Adam, L., Gregory, C., Zeming, L., Edward, Y., Alban, D., Alykhan, T., Andreas, K., James, B., Luca, A., Martin, R., Natalia, G., Sasank, C., Trevor, K., Lu, F., and Junjie, B. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS, 2019.
- Bishop, C. M. Pattern recognition and machine learning. 2006.
- Bjorck, N., Gomes, C. P., Selman, B., and Weinberger, K. Q. Understanding Batch Normalization. In NeurIPS, 2018.
- Boyd, S. and Vandenberghe, L. Convex optimization. 2004.
- Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv preprint arXiv:2002.05709, 2020.
- Chen, X., Wang, S., Fu, B., Long, M., and Wang, J. Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning. In Advances in Neural Information Processing Systems, pp. 1906–1916, 2019.
- Cooijmans, T., Ballas, N., Laurent, C., Çaglar Gülçehre, and Courville, A. Recurrent batch normalization, 2017.
- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, 2019.
- Guo, Y., Wu, Q., Deng, C., Chen, J., and Tan, M. Double forward propagation for memorized batch normalization. In AAAI, 2018.
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778, 2016a.
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016b.
- He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L. Densely connected convolutional networks. In CVPR, 2017.
- Ioffe, S. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in neural information processing systems, pp. 1945–1953, 2017.
- Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Jung, H., Lee, S., Yim, J., Park, S., and Kim, J. Joint Fine-Tuning in Deep Neural Networks for Facial Expression Recognition. 2015.
- Krause, J., Stark, M., Deng, J., and Fei-Fei, L. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
- LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 2015.
- Li, X., Yves, G., and Franck, D. Explicit inductive bias for transfer learning with convolutional networks. In ICML, pp. 2830–2839, 2018.
- Li, X., Chen, S., Hu, X., and Yang, J. Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift. In CVPR, 2019a.
- Li, X., Xiong, H., Wang, H., Rao, Y., Liu, L., and Huan, J. Delta: Deep learning transfer using feature map with attention for convolutional networks. In International Conference on Learning Representations (ICLR), 2019b.
- Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft. Technical report, 2013.
- Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. In ICLR, 2018.
- Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., Yu, G., and Sun, J. MegDet: A Large Mini-Batch Object Detector. In CVPR, 2018.
- Raghu, M., Zhang, C., Kleinberg, J., and Bengio, S. Transfusion: Understanding transfer learning for medical imaging. In Advances in Neural Information Processing Systems, pp. 3342–3352, 2019.
- Salimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. 2016.
- Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How Does Batch Normalization Help Optimization? In NeurIPS, 2018.
- Shekhovtsov, A. and Flach, B. Stochastic normalizations as bayesian learning. In ACCV, 2018.
- Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. 2015.
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In IEEE conference on computer vision and pattern recognition (CVPR), pp. 2818–2826, 2016.
- Tang, Y., Wang, Y., Xu, Y., Shi, B., Xu, C., Xu, C., and Xu, C. Beyond Dropout: Feature Map Distortion to Regularize Deep Neural Networks. In AAAI, 2020.
- Ulyanov, D., Vedaldi, A., and Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., and Summers, R. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 3462–3471, 2017.
- Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., and Perona, P. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
- Wu, Y. and He, K. Group normalization. In ECCV, 2018.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.