AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This paper proposes Stochastic Normalization to improve the widely used batch normalization module in a dropout-like way to battle against over-fitting

Stochastic Normalization

NIPS 2020, (2020)

Cited by: 2|Views45
Full Text
Bibtex
Weibo

Abstract

Fine-tuning pre-trained deep networks on a small dataset is an important component in the deep learning pipeline. A critical problem in fine-tuning is how to avoid over-fitting when data are limited. Existing efforts work from two aspects: (1) impose regularization on parameters or features; (2) transfer prior knowledge to fine-tuning by ...More

Code:

Data:

0
Introduction
  • Training deep networks (Szegedy et al, 2015; He et al, 2016b; Huang et al, 2017) from scratch requires large amounts of data.
  • For each new task at hand, it is unrealistic to collect a new dataset at the scale of ImageNet. Thanks to the release of pre-trained deep networks in PyTorch (Benoit et al, 2019) and TensorFlow (Abadi et al, 2016), practitioners can benefit from deep learning (LeCun et al, 2015) even with a small amount of data.
  • The practice of transferring pre-trained parameters, a.k.a. fine-tuning, is prevalent in both computer vision (Jung et al, 2015) and natural language processing (Devlin et al, 2019)
Highlights
  • Training deep networks (Szegedy et al, 2015; He et al, 2016b; Huang et al, 2017) from scratch requires large amounts of data
  • With the rapid development of transfer learning, medical image analysis can benefit from deep learning even with a small amount of data
  • We design experiments with 5% samples, 10% samples, and 15% samples. The task in this dataset is multi-label binary classification and the evaluation metric is the average AUC for the fourteen diseases
  • This paper proposes Stochastic Normalization (StochNorm) to improve the widely used batch normalization module in a dropout-like way to battle against over-fitting
  • This paper proposes a new network module called StochNorm as the basic building block of deep neural networks
  • Experimental results indicate that our method can outperform state-of-the-art fine-tuning methods over four datasets when there are limited data
  • Its broader impact depends on the usage scenario of fine-tuning in deep learning applications
Methods
  • StochNorm is compared with several fine-tuning methods: vanilla fine-tuning; L2-SP (Li et al, 2018) which regularizes the weight parameters around pre-trained parameters to alleviate catastrophic forgetting; DELTA (Li et al, 2019b) which selects features with a supervised attention mechanism; and BSS (Chen et al, 2019) which penalize small eigenvalues of feature representations to protect training from negative transfer.
  • Hyper-parameters for each method are selected on validation data.
  • The authors follow the train/validation/test partition of each dataset.
Results
  • Medical image analysis.
  • With the rapid development of transfer learning, medical image analysis can benefit from deep learning even with a small amount of data.
  • The authors design experiments with 5% samples, 10% samples, and 15% samples.
  • The task in this dataset is multi-label binary classification and the evaluation metric is the average AUC for the fourteen diseases.
Conclusion
  • How to alleviate over-fitting in fine-tuning with small datasets is an important problem.
  • This paper proposes a new network module called StochNorm as the basic building block of deep neural networks
  • It can greatly improve fine-tuning of pretrained models in the small data regime.
  • Its broader impact depends on the usage scenario of fine-tuning in deep learning applications.
  • It may inspire some researchers for further investigation of regularization techniques
Tables
  • Table1: Comparing StochNorm and existing methods on regularization type and knowledge transfer. left) lists primary regularization techniques. <a class="ref-link" id="cLi_et+al_2018_a" href="#rLi_et+al_2018_a">Li et al (2018</a>) regularize the parameters near their pre-trained values, <a class="ref-link" id="cLi_et+al_2019_b" href="#rLi_et+al_2019_b">Li et al (2019b</a>) regularize the features near features computed by pre-trained networks, and <a class="ref-link" id="cChen_et+al_2019_a" href="#rChen_et+al_2019_a">Chen et al (2019</a>) penalize small eigenvalues of feature representations. As an alternative to parameter regularization and feature regularization, the proposed StochNorm regularizes fine-tuning by module design, which we interpret as “architecture regularization.”. right) lists whether each type of knowledge is transferred during fine-tuning, with a focus on commonly used ConvNets. Knowledge-free layers like max-pooling and ReLU function are omitted in the table. Usually ConvNets are constructed by stacking Conv-BN-ReLU blocks, followed by a task-specific fully-connected layer. It is a common belief that the knowledge in fully-connected layers is task-specific and cannot be transferred. Transferring learnable parameters (weight and bias) is as easy as just reusing them. Nevertheless, moving statistics in BN layers are simply discarded due to the characteristic behavior of BN (see Section 4.2). The proposed StochNorm also transfers moving statistics of pre-trained networks to exploit prior knowledge in pre-trained networks better
  • Table2: Average performances (AUC) of diagnosing different pathologies on NIH Chest X-ray
  • Table3: Top-1 Accuracy (%) of StochNorm and different methods (Backbone: ResNet-50)
  • Table4: Accuracy of StochNorm integrated with different methods (Backbone: ResNet-50)
Download tables as Excel
Related work
  • Our work is related to regularization and normalization techniques used in deep learning, which are reviewed respectively in the following.

    2.1 Normalization Techniques

    Normalizing input features helps optimization because widely used first-order optimization algorithms such as SGD work better on more isotropic landscape (Boyd & Vandenberghe, 2004). Later Ioffe & Szegedy (2015) propose Batch Normalization (BN) to normalize intermediate feature maps by statistics computed with mini-batch samples and find that it greatly helps training of deep networks.

    Inspired by Ioffe & Szegedy (2015), many normalization techniques are introduced to deal with different learning scenarios. Layer Normalization (Ba et al, 2016) and Recurrent Batch Normalization (Cooijmans et al, 2017) are effective in recurrent neural networks, Group Normalization (Wu & He, 2018) is designed for object detection, Instance Normalization (Ulyanov et al, 2016) fastens the neural stylization, Weight Normalization (Salimans & Kingma, 2016) speeds up convergence of SGD by a simple re-parameterization, and Spectral Normalization (Miyato et al, 2018) addresses the mode collapse problem in generative adversarial networks. Shekhovtsov & Flach (2018) interprets BN as Bayesian learning, and proposes how to incorporate Bayesian learning into other normalization modules. These normalization modules are tailored to specific optimization problems but are not related to fine-tuning. Among them, BN is the most widely used normalization module in deep learning. Thus, this paper focuses on ConvNets normalized by BN layers.
Funding
  • Acknowledgments and Disclosure of Funding This work was supported by the National Natural Science Foundation of China (61772299, 71690231), Beijing Nova Program (Z201100006820041), University S&T Innovation Plan by the Ministry of Education of China
Study subjects and analysis
datasets: 4
To evaluate StochNorm, we apply it to four visual recognition tasks. Experimental results indicate that our method can outperform state-of-the-art fine-tuning methods over four datasets when there are limited data. We also conduct insight analysis and ablation study to better understand StochNorm

standard datasets: 4
Datasets. The evaluation is conducted on four standard datasets. CUB-200-2011 (Welinder et al, 2010) is a dataset for fine-grained bird recognition with 200 bird species and 11, 788 images

bird species: 200
The evaluation is conducted on four standard datasets. CUB-200-2011 (Welinder et al, 2010) is a dataset for fine-grained bird recognition with 200 bird species and 11, 788 images. It is an extended version of the CUB-200 dataset

patients with fourteen disease labels: 30805
The dataset contains 10, 000 aircraft images, with 100 images for each of the 100 categories. NIH Chest X-ray (Wang et al, 2017) consists of 112, 120 frontal-view X-ray images of 30, 805 patients with fourteen disease labels (each image can have multiple labels). Compared methods

datasets: 3
In this case, StochNorm gets an average of 4.1% increase compared with vanilla fine-tuning, demonstrating the regularization effect of StochNorm. Compared with state-of-the-art fine-tuning methods, StochNorm is superior across a wide spectrum of sampling rates for these three datasets. It is worth to note that we work on avoiding over-fitting in fine-tuning

Reference
  • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., and Isard, M. Tensorflow: a system for large-scale machine learning. In OSDI, 2016.
    Google ScholarLocate open access versionFindings
  • Ba, L. J., Kiros, J. R., and Hinton, G. E. Layer normalization. In NeurIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Benoit, S., Zachary, D., Soumith, C., Sam, G., Adam, P., Francisco, M., Adam, L., Gregory, C., Zeming, L., Edward, Y., Alban, D., Alykhan, T., Andreas, K., James, B., Luca, A., Martin, R., Natalia, G., Sasank, C., Trevor, K., Lu, F., and Junjie, B. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Bishop, C. M. Pattern recognition and machine learning. 2006.
    Google ScholarFindings
  • Bjorck, N., Gomes, C. P., Selman, B., and Weinberger, K. Q. Understanding Batch Normalization. In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Boyd, S. and Vandenberghe, L. Convex optimization. 2004.
    Google ScholarFindings
  • Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv preprint arXiv:2002.05709, 2020.
    Findings
  • Chen, X., Wang, S., Fu, B., Long, M., and Wang, J. Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning. In Advances in Neural Information Processing Systems, pp. 1906–1916, 2019.
    Google ScholarLocate open access versionFindings
  • Cooijmans, T., Ballas, N., Laurent, C., Çaglar Gülçehre, and Courville, A. Recurrent batch normalization, 2017.
    Google ScholarFindings
  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • Guo, Y., Wu, Q., Deng, C., Chen, J., and Tan, M. Double forward propagation for memorized batch normalization. In AAAI, 2018.
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778, 2016a.
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016b.
    Google ScholarLocate open access versionFindings
  • He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L. Densely connected convolutional networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Ioffe, S. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in neural information processing systems, pp. 1945–1953, 2017.
    Google ScholarLocate open access versionFindings
  • Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Jung, H., Lee, S., Yim, J., Park, S., and Kim, J. Joint Fine-Tuning in Deep Neural Networks for Facial Expression Recognition. 2015.
    Google ScholarFindings
  • Krause, J., Stark, M., Deng, J., and Fei-Fei, L. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
    Google ScholarLocate open access versionFindings
  • LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 2015.
    Google ScholarLocate open access versionFindings
  • Li, X., Yves, G., and Franck, D. Explicit inductive bias for transfer learning with convolutional networks. In ICML, pp. 2830–2839, 2018.
    Google ScholarLocate open access versionFindings
  • Li, X., Chen, S., Hu, X., and Yang, J. Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift. In CVPR, 2019a.
    Google ScholarLocate open access versionFindings
  • Li, X., Xiong, H., Wang, H., Rao, Y., Liu, L., and Huan, J. Delta: Deep learning transfer using feature map with attention for convolutional networks. In International Conference on Learning Representations (ICLR), 2019b.
    Google ScholarLocate open access versionFindings
  • Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft. Technical report, 2013.
    Google ScholarFindings
  • Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., Yu, G., and Sun, J. MegDet: A Large Mini-Batch Object Detector. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Raghu, M., Zhang, C., Kleinberg, J., and Bengio, S. Transfusion: Understanding transfer learning for medical imaging. In Advances in Neural Information Processing Systems, pp. 3342–3352, 2019.
    Google ScholarLocate open access versionFindings
  • Salimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. 2016.
    Google ScholarFindings
  • Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How Does Batch Normalization Help Optimization? In NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • Shekhovtsov, A. and Flach, B. Stochastic normalizations as bayesian learning. In ACCV, 2018.
    Google ScholarLocate open access versionFindings
  • Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
    Google ScholarLocate open access versionFindings
  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. 2015.
    Google ScholarFindings
  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In IEEE conference on computer vision and pattern recognition (CVPR), pp. 2818–2826, 2016.
    Google ScholarLocate open access versionFindings
  • Tang, Y., Wang, Y., Xu, Y., Shi, B., Xu, C., Xu, C., and Xu, C. Beyond Dropout: Feature Map Distortion to Regularize Deep Neural Networks. In AAAI, 2020.
    Google ScholarLocate open access versionFindings
  • Ulyanov, D., Vedaldi, A., and Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
    Findings
  • Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., and Summers, R. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 3462–3471, 2017.
    Google ScholarLocate open access versionFindings
  • Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., and Perona, P. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
    Google ScholarFindings
  • Wu, Y. and He, K. Group normalization. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科