Survey of Dropout Methods for Deep Neural Networks

arXiv: Neural and Evolutionary Computing, 2019.

Cited by: 15|Bibtex|Views43
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
The growth of convolutional and recurrent neural networks in practice has prompted the development of specialized methods that perform better than standard dropout on specific kinds of neural networks

Abstract:

Dropout methods are a family of stochastic techniques used in neural network training or inference that have generated significant research interest and are widely used in practice. They have been successfully applied in neural network regularization, model compression, and in measuring the uncertainty of neural network outputs. While ori...More

Code:

Data:

0
Introduction
  • Deep neural networks are a topic of widespread interest in contemporary artificial intelligence and signal processing.
  • The behaviour of standard dropout during training for a neural network layer is given by: y = f (Wx) ◦ m, mi ∼ Bernoulli(1 − p) where y is the layer output, f (·) is the activation function, W is the layer weight matrix, x is the layer input, and m is the layer dropout mask, with each element mi being 0 with probability p.
Highlights
  • Deep neural networks are a topic of widespread interest in contemporary artificial intelligence and signal processing
  • A wide range of stochastic techniques inspired by the original dropout method have been proposed for use with deep learning models
  • We have described a wide range of advances in dropout methods above
  • It is generally accepted that standard dropout can regularize a wide range of neural network models, but there is room to achieve either faster training convergence or better final performance
  • The growth of convolutional and recurrent neural networks in practice has prompted the development of specialized methods that perform better than standard dropout on specific kinds of neural networks
  • The growth of Bayesian interpretations of dropout methods over the last few years points to new opportunities in theoretical justifications of dropout and similar stochastic methods, which corresponds to a broader trend of Bayesian and variational techniques advancing research into deep neural networks
Results
  • The authors show that training a neural network with standard dropout is equivalent to optimizing a variational objective between an approximate distribution and the posterior of a deep Gaussian process, which is a Bayesian machine learning model.
  • This section describes significant dropout methods that, like standard dropout, regularize dense feedforward neural network layers during training.
  • Several proposed dropout methods seek to improve regularization or speed up convergence by making dropout adaptive, that is tuning dropout probabilities during training based on neuron weights or activations.
  • Convolutional neural network layers require different regularization methods than standard dropout in order to generalize well [13, 38].
  • The authors show that if dropout is seen as a variational Monte Carlo approximation to a Bayesian posterior, the natural way to apply it to recurrent layers is to generate a dropout mask that zeroes out both feedforward and recurrent connections for each training sequence, but to keep the same mask for each time step in the sequence.
  • This property means that dropout methods can be applied in compressing neural network models by reducing the number of parameters needed to perform effectively.
  • A deep Gaussian process is a Bayesian machine learning model that would normally produce a probability distribution as its output, and applying standard dropout at test time can be used to estimate characteristics of this underlying distribution.
  • It is generally accepted that standard dropout can regularize a wide range of neural network models, but there is room to achieve either faster training convergence or better final performance.
Conclusion
  • There are opportunities to develop improved methods that are specialized for particular kinds of networks or that use more advanced approaches for selecting neurons to drop.
  • The growth of Bayesian interpretations of dropout methods over the last few years points to new opportunities in theoretical justifications of dropout and similar stochastic methods, which corresponds to a broader trend of Bayesian and variational techniques advancing research into deep neural networks.
Summary
  • Deep neural networks are a topic of widespread interest in contemporary artificial intelligence and signal processing.
  • The behaviour of standard dropout during training for a neural network layer is given by: y = f (Wx) ◦ m, mi ∼ Bernoulli(1 − p) where y is the layer output, f (·) is the activation function, W is the layer weight matrix, x is the layer input, and m is the layer dropout mask, with each element mi being 0 with probability p.
  • The authors show that training a neural network with standard dropout is equivalent to optimizing a variational objective between an approximate distribution and the posterior of a deep Gaussian process, which is a Bayesian machine learning model.
  • This section describes significant dropout methods that, like standard dropout, regularize dense feedforward neural network layers during training.
  • Several proposed dropout methods seek to improve regularization or speed up convergence by making dropout adaptive, that is tuning dropout probabilities during training based on neuron weights or activations.
  • Convolutional neural network layers require different regularization methods than standard dropout in order to generalize well [13, 38].
  • The authors show that if dropout is seen as a variational Monte Carlo approximation to a Bayesian posterior, the natural way to apply it to recurrent layers is to generate a dropout mask that zeroes out both feedforward and recurrent connections for each training sequence, but to keep the same mask for each time step in the sequence.
  • This property means that dropout methods can be applied in compressing neural network models by reducing the number of parameters needed to perform effectively.
  • A deep Gaussian process is a Bayesian machine learning model that would normally produce a probability distribution as its output, and applying standard dropout at test time can be used to estimate characteristics of this underlying distribution.
  • It is generally accepted that standard dropout can regularize a wide range of neural network models, but there is room to achieve either faster training convergence or better final performance.
  • There are opportunities to develop improved methods that are specialized for particular kinds of networks or that use more advanced approaches for selecting neurons to drop.
  • The growth of Bayesian interpretations of dropout methods over the last few years points to new opportunities in theoretical justifications of dropout and similar stochastic methods, which corresponds to a broader trend of Bayesian and variational techniques advancing research into deep neural networks.
Reference
  • G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
    Findings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, 2012, pp. 1097– 1105.
    Google ScholarLocate open access versionFindings
  • L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th International Conference on Machine Learning. PLMR, 2013.
    Google ScholarLocate open access versionFindings
  • L. J. Ba and B. Frey, “Adaptive dropout for training deep neural networks,” in Proceedings of the 26th International Conference on Neural Information Processing Systems. NIPS, 2013.
    Google ScholarLocate open access versionFindings
  • S. Wang and C. Manning, “Fast dropout training,” in Proceedings of the 30th International Conference on Machine Learning. PLMR, 2013.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in Advances in Neural Information Processing Systems 28, 2015, pp. 2575–2583.
    Google ScholarLocate open access versionFindings
  • Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of the 33rd International Conference on Machine Learning. PLMR, 2016.
    Google ScholarLocate open access versionFindings
  • D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2498–2507.
    Google ScholarLocate open access versionFindings
  • K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov, “Structured bayesian pruning via log-normal multiplicative noise,” in Advances in Neural Information Processing Systems 30, 2017, pp. 6775–6784.
    Google ScholarLocate open access versionFindings
  • H. Salehinejad and S. Valaee, “Ising-dropout: A regularization method for training and compression of deep neural networks,” arXiv preprint arXiv:1902.08673, 2019.
    Findings
  • A. N. Gomez, I. Zhang, K. Swersky, Y. Gal, and G. E. Hinton, “Targeted dropout,” in 2018 CDNNRIA Workshop at the 32nd Conference on Neural Information Processing Systems. NeurIPS, 2018.
    Google ScholarLocate open access versionFindings
  • H. Wu and X. Gu, “Towards dropout training for convolutional neural networks,” Neural Networks, vol. 71, no. C, pp. 1–10, 2015.
    Google ScholarLocate open access versionFindings
  • J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object localization using convolutional networks.” in IEEE CVPR. IEEE, 2015, pp. 648–656.
    Google ScholarLocate open access versionFindings
  • S. Park and N. Kwak, “Analysis on the dropout effect in convolutional neural networks,” in Asian Conference on Computer Vision. Springer, 2016, pp. 189–204.
    Google ScholarLocate open access versionFindings
  • T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
    Findings
  • G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger, “Deep networks with stochastic depth,” arXiv preprint arXiv:1603.09382, 2016.
    Findings
  • S. Cai, J. Gao, M. Zhang, W. Wang, G. Chen, and B. C. Ooi, “Effective and efficient dropout for deep convolutional neural networks,” arXiv preprint arXiv:1904.03392, 2019.
    Findings
  • S. H. Khan, M. Hayat, and F. Porikli, “Regularization of deep neural networks with spectral dropout,” Neural Networks, vol. 110, pp. 82–90, 2019.
    Google ScholarLocate open access versionFindings
  • S. Hou and Z. Wang, “Weighted channel dropout for regularization of deep convolutional neural network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
    Google ScholarLocate open access versionFindings
  • T. Moon, H. Choi, H. Lee, and I. Song, “Rnndrop: A novel dropout for RNNs in ASR,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” in Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • S. Semeniuta, A. Severyn, and E. Barth, “Recurrent dropout without memory loss,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics, 2016.
    Google ScholarLocate open access versionFindings
  • D. Krueger, T. Maharaj, J. Kramar, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, H. Larochelle, A. C. Courville, and C. Pal, “Zoneout: Regularizing RNNs by randomly preserving hidden activations,” arXiv preprint arXiv:1606.01305, 2016.
    Findings
  • S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing LSTM language models,” arXiv preprint arXiv:1708.02182, 2017.
    Findings
  • G. Melis, C. Dyer, and P. Blunsom, “On the state of the art of evaluation in neural language models,” arXiv preprint arXiv:1707.05589, 2017.
    Findings
  • K. Zołna, D. Arpit, D. Suhubdy, and Y. Bengio, “Fraternal dropout,” arXiv preprint arXiv:1711.00066, 2018.
    Findings
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • D. Warde-Farley, I. J. Goodfellow, A. Courville, and Y. Bengio, “An empirical analysis of dropout in piecewise linear networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2014.
    Google ScholarLocate open access versionFindings
  • P. Baldi and P. J. Sadowski, “Understanding dropout,” in Advances in Neural Information Processing Systems 26, 2013, pp. 2814–2822.
    Google ScholarLocate open access versionFindings
  • S. Wager, S. Wang, and P. S. Liang, “Dropout training as adaptive regularization,” in Advances in neural information processing systems, 2013, pp. 351–359.
    Google ScholarFindings
  • D. P. Helmbold and P. M. Long, “On the inductive bias of dropout,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 3403–3454, 2015.
    Google ScholarLocate open access versionFindings
  • X. Bouthillier, K. Konda, P. Vincent, and R. Memisevic, “Dropout as data augmentation,” arXiv preprint arXiv:1506.08700, 2015.
    Findings
  • A. Achille and S. Soatto, “Information dropout,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pp. 2897–2905, 2018.
    Google ScholarLocate open access versionFindings
  • Z. Li, B. Gong, and T. Yang, “Improved dropout for shallow and deep learning,” in Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • Y. Gal, J. Hron, and A. Kendall, “Concrete dropout,” in Advances in Neural Information Processing Systems 30, 2017, pp. 3581–3590.
    Google ScholarLocate open access versionFindings
  • S. J. Rennie, V. Goel, and S. Thomas, “Annealed dropout training of deep networks,” in IEEE Spoken Language Technology Workshop (SLT), 2014.
    Google ScholarLocate open access versionFindings
  • P. Morerio, J. Cavazza, R. Volpi, R. Vidal, and V. Murino, “Curriculum dropout,” arXiv preprint arXiv:1703.06229, 2017.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016, pp. 770–778.
    Google ScholarLocate open access versionFindings
  • S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol.
    Google ScholarLocate open access versionFindings
  • 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 448–456.
    Google ScholarFindings
  • [40] S. Singh, D. Hoiem, and D. Forsyth, “Swapout: Learning an ensemble of deep architectures,” in Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS, 2016.
    Google ScholarLocate open access versionFindings
  • [41] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2015.
    Findings
  • [42] H. Salehinejad, J. Baarbe, S. Sankar, J. Barfett, E. Colak, and S. Valaee, “Recent advances in recurrent neural networks,” arXiv preprint arXiv:1801.01078, 2017.
    Findings
  • [43] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
    Google ScholarLocate open access versionFindings
  • [44] Y. Gal, “Uncertainty in deep learning,” Ph.D. dissertation, University of Cambridge, 2016.
    Google ScholarFindings
  • [45] L. Zhu and N. Laptev, “Deep and confident prediction for time series at Uber,” arXiv preprint arXiv:1709.01907, 2017.
    Findings
  • [46] A. Jungo, R. McKinley, R. Meier, U. Knecht, L. Vera, J. Perez-Beteta, D. Molina-Garcıa, V. M. Perez-Garcıa, R. Wiest, and M. Reyes, “Towards uncertainty-assisted brain tumor segmentation and survival prediction,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, A. Crimi, S. Bakas, H. Kuijf, B. Menze, and M. Reyes, Eds. Cham: Springer International Publishing, 2018, pp. 474–485.
    Google ScholarFindings
  • [47] B. Lakshminarayanan, A. Pritzel, and C. Blundel, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Proceedings of the 31th International Conference on Neural Information Processing Systems. NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • [48] S. Park, J. Park, S.-J. Shin, and I.-C. Moon, “Adversarial dropout for supervised and semisupervised learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • [49] K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Adversarial dropout regularization,” arXiv preprint arXiv:1711.01575, 2018.
    Findings
  • [50] S. Park, K. Song, M. Ji, W. Lee, and I.-C. Moon, “Adversarial dropout for recurrent neural networks,” arXiv preprint arXiv:1904.09816, 2019.
    Findings
  • [51] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
    Findings
  • [52] Y. Li and Y. Gal, “Dropout inference in Bayesian neural networks with alpha-divergences,” in Proceedings of the 34th international conference on machine learning (ICML’17), 2017, pp. 2052–2061.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments