Dropout: a simple way to prevent neural networks from overfitting

Journal of Machine Learning Research, pp. 1929-1958, 2014.

Cited by: 12358|Bibtex|Views632|Links
EI
Keywords:
deep learningmodel combinationneural networksregularization
Wei bo:
We found that dropout improved generalization performance on all data sets compared to neural networks that did not use dropout

Abstract:

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique f...More

Code:

Data:

0
Introduction
  • Deep neural networks contain multiple non-linear hidden layers and this makes them very expressive models that can learn very complicated relationships between their inputs and outputs.
  • Many of these complicated relationships will be the result of sampling noise, so they will exist in the training set but not in real test data even if it is drawn from the same distribution
  • This leads to overfitting and many methods have been developed for reducing it.
  • The best way to “regularize” a fixed-sized model is to average the predictions of all possible settings of the parameters, weighting each setting by c 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
  • These include stopping the training as soon as performance on a validation set starts to get worse, introducing weight penalties of various kinds such as L1 and L2 regularization and soft weight sharing (Nowlan and Hinton, 1992).
Highlights
  • Deep neural networks contain multiple non-linear hidden layers and this makes them very expressive models that can learn very complicated relationships between their inputs and outputs
  • While 5% noise typically works best for Denoising Autoencoders, we found that our weight scaling procedure applied at test time enables us to use much higher noise levels
  • We found that dropout improved generalization performance on all data sets compared to neural networks that did not use dropout
  • Dropout is a technique for improving neural networks by reducing overfitting
  • Random dropout breaks up these co-adaptations by making the presence of any particular hidden unit unreliable. This technique was found to improve the performance of neural nets in a wide variety of application domains including object classification, digit recognition, speech recognition, document classification and analysis of computational biology data
  • Dropout considerably improved the performance of standard neural nets on other data sets as well
Methods
  • Standard Neural Net (Simard et al, 2003) SVM Gaussian kernel Dropout NN Dropout NN Dropout NN + max-norm constraint Dropout NN + max-norm constraint Dropout NN + max-norm constraint Dropout NN + max-norm constraint Dropout NN + max-norm constraint (Goodfellow et al, 2013).
  • DBN + finetuning (Hinton and Salakhutdinov, 2006) DBM + finetuning (Salakhutdinov and Hinton, 2009) DBN + dropout finetuning DBM + dropout finetuning Unit Type Logistic NA.
  • Logistic ReLU ReLU ReLU ReLU ReLU Maxout.
  • 2 layers, 800 units NA.
Results
  • The authors trained dropout neural networks for classification problems on data sets in different domains.
  • A 4-layer net pretrained with a stack of RBMs get a phone error rate of 22.7%
  • With dropout, this reduces to 19.7%.
  • The authors recently discovered that multiplying by a random variable drawn from N (1, 1) works just as well, or perhaps better than using Bernoulli noise
  • This new form of dropout amounts to adding a Gaussian distributed random variable with zero mean and standard deviation equal to the activation of the unit.
  • The expected value of the activations remains unchanged, no weight scaling is required at test time
Conclusion
  • Dropout is a technique for improving neural networks by reducing overfitting. Standard backpropagation learning builds up brittle co-adaptations that work for the training data but do not generalize to unseen data.
  • Random dropout breaks up these co-adaptations by making the presence of any particular hidden unit unreliable
  • This technique was found to improve the performance of neural nets in a wide variety of application domains including object classification, digit recognition, speech recognition, document classification and analysis of computational biology data.
  • This suggests that dropout is a general technique and is not specific to any domain.
  • Dropout considerably improved the performance of standard neural nets on other data sets as well
Summary
  • Introduction:

    Deep neural networks contain multiple non-linear hidden layers and this makes them very expressive models that can learn very complicated relationships between their inputs and outputs.
  • Many of these complicated relationships will be the result of sampling noise, so they will exist in the training set but not in real test data even if it is drawn from the same distribution
  • This leads to overfitting and many methods have been developed for reducing it.
  • The best way to “regularize” a fixed-sized model is to average the predictions of all possible settings of the parameters, weighting each setting by c 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
  • These include stopping the training as soon as performance on a validation set starts to get worse, introducing weight penalties of various kinds such as L1 and L2 regularization and soft weight sharing (Nowlan and Hinton, 1992).
  • Methods:

    Standard Neural Net (Simard et al, 2003) SVM Gaussian kernel Dropout NN Dropout NN Dropout NN + max-norm constraint Dropout NN + max-norm constraint Dropout NN + max-norm constraint Dropout NN + max-norm constraint Dropout NN + max-norm constraint (Goodfellow et al, 2013).
  • DBN + finetuning (Hinton and Salakhutdinov, 2006) DBM + finetuning (Salakhutdinov and Hinton, 2009) DBN + dropout finetuning DBM + dropout finetuning Unit Type Logistic NA.
  • Logistic ReLU ReLU ReLU ReLU ReLU Maxout.
  • 2 layers, 800 units NA.
  • Results:

    The authors trained dropout neural networks for classification problems on data sets in different domains.
  • A 4-layer net pretrained with a stack of RBMs get a phone error rate of 22.7%
  • With dropout, this reduces to 19.7%.
  • The authors recently discovered that multiplying by a random variable drawn from N (1, 1) works just as well, or perhaps better than using Bernoulli noise
  • This new form of dropout amounts to adding a Gaussian distributed random variable with zero mean and standard deviation equal to the activation of the unit.
  • The expected value of the activations remains unchanged, no weight scaling is required at test time
  • Conclusion:

    Dropout is a technique for improving neural networks by reducing overfitting. Standard backpropagation learning builds up brittle co-adaptations that work for the training data but do not generalize to unseen data.
  • Random dropout breaks up these co-adaptations by making the presence of any particular hidden unit unreliable
  • This technique was found to improve the performance of neural nets in a wide variety of application domains including object classification, digit recognition, speech recognition, document classification and analysis of computational biology data.
  • This suggests that dropout is a general technique and is not specific to any domain.
  • Dropout considerably improved the performance of standard neural nets on other data sets as well
Tables
  • Table1: Overview of the data sets used in this paper
  • Table2: Comparison of different models on MNIST. The MNIST data set consists of 28 × 28 pixel handwritten digit images. The task is to classify the images into 10 digit classes
  • Table3: Results on the Street View House Numbers data set
  • Table4: Error rates on CIFAR-10 and CIFAR-100
  • Table5: Results on the ILSVRC-2010 test set
  • Table6: Results on the ILSVRC-2012 validation/test set
  • Table7: Phone error rate on the TIMIT core test set
  • Table8: Results on the Alternative Splicing Data Set
  • Table9: Comparison of different regularization methods on MNIST
  • Table10: Comparison of classification error % with Bernoulli and Gaussian dropout. For MNIST, the Bernoulli model uses p = 0.5 for the hidden units and p = 0.8 for the input units
Download tables as Excel
Related work
  • Dropout can be interpreted as a way of regularizing a neural network by adding noise to its hidden units. The idea of adding noise to the states of units has previously been used in the context of Denoising Autoencoders (DAEs) by Vincent et al (2008, 2010) where noise is added to the input units of an autoencoder and the network is trained to reconstruct the noise-free input. Our work extends this idea by showing that dropout can be effectively applied in the hidden layers as well and that it can be interpreted as a form of model averaging. We also show that adding noise is not only useful for unsupervised feature learning but can also be extended to supervised learning problems. In fact, our method can be applied to other neuron-based architectures, for example, Boltzmann Machines. While 5% noise typically works best for DAEs, we found that our weight scaling procedure applied at test time enables us to use much higher noise levels. Dropping out 20% of the input units and 50% of the hidden units was often found to be optimal.
Funding
  • This research was supported by OGS, NSERC and an Early Researcher Award
Reference
  • M. Chen, Z. Xu, K. Weinberger, and F. Sha. Marginalized denoising autoencoders for domain adaptation. In Proceedings of the 29th International Conference on Machine Learning, pages 767–774. ACM, 2012.
    Google ScholarLocate open access versionFindings
  • G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton. Phone recognition with the meancovariance restricted Boltzmann machine. In Advances in Neural Information Processing Systems 23, pages 469–477, 2010.
    Google ScholarLocate open access versionFindings
  • O. Dekel, O. Shamir, and L. Xiao. Learning to classify with missing and corrupted features. Machine Learning, 81(2):149–178, 2010.
    Google ScholarLocate open access versionFindings
  • A. Globerson and S. Roweis. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd International Conference on Machine Learning, pages 353–360. ACM, 2006.
    Google ScholarLocate open access versionFindings
  • I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, pages 1319– 1327. ACM, 2013.
    Google ScholarLocate open access versionFindings
  • G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006.
    Google ScholarLocate open access versionFindings
  • G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006.
    Google ScholarLocate open access versionFindings
  • K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In Proceedings of the International Conference on Computer Vision (ICCV’09). IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
    Google ScholarFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.
    Google ScholarLocate open access versionFindings
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
    Google ScholarLocate open access versionFindings
  • Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, Z. Li, M.-H. Tsai, X. Zhou, T. Huang, and T. Zhang. Imagenet classification: fast descriptor coding and large-scale svm training. Large scale visual recognition challenge, 2010.
    Google ScholarFindings
  • A. Livnat, C. Papadimitriou, N. Pippenger, and M. W. Feldman. Sex, mixability, and modularity. Proceedings of the National Academy of Sciences, 107(4):1452–1457, 2010.
    Google ScholarLocate open access versionFindings
  • V. Mnih. CUDAMat: a CUDA-based matrix class for Python. Technical Report UTML TR 2009-004, Department of Computer Science, University of Toronto, November 2009.
    Google ScholarFindings
  • A. Mohamed, G. E. Dahl, and G. E. Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 2010.
    Google ScholarLocate open access versionFindings
  • R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., 1996.
    Google ScholarFindings
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
    Google ScholarLocate open access versionFindings
  • S. J. Nowlan and G. E. Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4), 1992.
    Google ScholarLocate open access versionFindings
  • D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011.
    Google ScholarLocate open access versionFindings
  • R. Salakhutdinov and G. Hinton. Deep Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 5, pages 448–455, 2009.
    Google ScholarLocate open access versionFindings
  • R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th International Conference on Machine Learning. ACM, 2008.
    Google ScholarLocate open access versionFindings
  • J. Sanchez and F. Perronnin. High-dimensional signature compression for large-scale image classification. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, pages 1665–1672, 2011.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition (ICPR 2012), 2012.
    Google ScholarLocate open access versionFindings
  • P. Simard, D. Steinkraus, and J. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, volume 2, pages 958–962, 2003.
    Google ScholarLocate open access versionFindings
  • J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2960–2968, 2012.
    Google ScholarLocate open access versionFindings
  • N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proceedings of the 18th annual conference on Learning Theory, COLT’05, pages 545–560. Springer-Verlag, 2005.
    Google ScholarLocate open access versionFindings
  • N. Srivastava. Improving Neural Networks with Dropout. Master’s thesis, University of Toronto, January 2013.
    Google ScholarFindings
  • R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B. Methodological, 58(1):267–288, 1996.
    Google ScholarLocate open access versionFindings
  • A. N. Tikhonov. On the stability of inverse problems. Doklady Akademii Nauk SSSR, 39(5): 195–198, 1943.
    Google ScholarLocate open access versionFindings
  • L. van der Maaten, M. Chen, S. Tyree, and K. Q. Weinberger. Learning with marginalized corrupted features. In Proceedings of the 30th International Conference on Machine Learning, pages 410–418. ACM, 2013.
    Google ScholarLocate open access versionFindings
  • P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, pages 1096–1103. ACM, 2008.
    Google ScholarLocate open access versionFindings
  • P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. In Proceedings of the 27th International Conference on Machine Learning, pages 3371–3408. ACM, 2010.
    Google ScholarLocate open access versionFindings
  • S. Wager, S. Wang, and P. Liang. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26, pages 351–359, 2013.
    Google ScholarLocate open access versionFindings
  • S. Wang and C. D. Manning. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning, pages 118–126. ACM, 2013.
    Google ScholarLocate open access versionFindings
  • H. Y. Xiong, Y. Barash, and B. J. Frey. Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics, 27(18):2554–2562, 2011.
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. CoRR, abs/1301.3557, 2013.
    Findings
Your rating :
0

 

Tags
Comments