Survey of Dropout Methods for Deep Neural Networks
arXiv: Neural and Evolutionary Computing, 2019.
EI
Weibo:
Abstract:
Dropout methods are a family of stochastic techniques used in neural network training or inference that have generated significant research interest and are widely used in practice. They have been successfully applied in neural network regularization, model compression, and in measuring the uncertainty of neural network outputs. While ori...More
Code:
Data:
Introduction
- Deep neural networks are a topic of widespread interest in contemporary artificial intelligence and signal processing.
- The behaviour of standard dropout during training for a neural network layer is given by: y = f (Wx) ◦ m, mi ∼ Bernoulli(1 − p) where y is the layer output, f (·) is the activation function, W is the layer weight matrix, x is the layer input, and m is the layer dropout mask, with each element mi being 0 with probability p.
Highlights
- Deep neural networks are a topic of widespread interest in contemporary artificial intelligence and signal processing
- A wide range of stochastic techniques inspired by the original dropout method have been proposed for use with deep learning models
- We have described a wide range of advances in dropout methods above
- It is generally accepted that standard dropout can regularize a wide range of neural network models, but there is room to achieve either faster training convergence or better final performance
- The growth of convolutional and recurrent neural networks in practice has prompted the development of specialized methods that perform better than standard dropout on specific kinds of neural networks
- The growth of Bayesian interpretations of dropout methods over the last few years points to new opportunities in theoretical justifications of dropout and similar stochastic methods, which corresponds to a broader trend of Bayesian and variational techniques advancing research into deep neural networks
Results
- The authors show that training a neural network with standard dropout is equivalent to optimizing a variational objective between an approximate distribution and the posterior of a deep Gaussian process, which is a Bayesian machine learning model.
- This section describes significant dropout methods that, like standard dropout, regularize dense feedforward neural network layers during training.
- Several proposed dropout methods seek to improve regularization or speed up convergence by making dropout adaptive, that is tuning dropout probabilities during training based on neuron weights or activations.
- Convolutional neural network layers require different regularization methods than standard dropout in order to generalize well [13, 38].
- The authors show that if dropout is seen as a variational Monte Carlo approximation to a Bayesian posterior, the natural way to apply it to recurrent layers is to generate a dropout mask that zeroes out both feedforward and recurrent connections for each training sequence, but to keep the same mask for each time step in the sequence.
- This property means that dropout methods can be applied in compressing neural network models by reducing the number of parameters needed to perform effectively.
- A deep Gaussian process is a Bayesian machine learning model that would normally produce a probability distribution as its output, and applying standard dropout at test time can be used to estimate characteristics of this underlying distribution.
- It is generally accepted that standard dropout can regularize a wide range of neural network models, but there is room to achieve either faster training convergence or better final performance.
Conclusion
- There are opportunities to develop improved methods that are specialized for particular kinds of networks or that use more advanced approaches for selecting neurons to drop.
- The growth of Bayesian interpretations of dropout methods over the last few years points to new opportunities in theoretical justifications of dropout and similar stochastic methods, which corresponds to a broader trend of Bayesian and variational techniques advancing research into deep neural networks.
Summary
- Deep neural networks are a topic of widespread interest in contemporary artificial intelligence and signal processing.
- The behaviour of standard dropout during training for a neural network layer is given by: y = f (Wx) ◦ m, mi ∼ Bernoulli(1 − p) where y is the layer output, f (·) is the activation function, W is the layer weight matrix, x is the layer input, and m is the layer dropout mask, with each element mi being 0 with probability p.
- The authors show that training a neural network with standard dropout is equivalent to optimizing a variational objective between an approximate distribution and the posterior of a deep Gaussian process, which is a Bayesian machine learning model.
- This section describes significant dropout methods that, like standard dropout, regularize dense feedforward neural network layers during training.
- Several proposed dropout methods seek to improve regularization or speed up convergence by making dropout adaptive, that is tuning dropout probabilities during training based on neuron weights or activations.
- Convolutional neural network layers require different regularization methods than standard dropout in order to generalize well [13, 38].
- The authors show that if dropout is seen as a variational Monte Carlo approximation to a Bayesian posterior, the natural way to apply it to recurrent layers is to generate a dropout mask that zeroes out both feedforward and recurrent connections for each training sequence, but to keep the same mask for each time step in the sequence.
- This property means that dropout methods can be applied in compressing neural network models by reducing the number of parameters needed to perform effectively.
- A deep Gaussian process is a Bayesian machine learning model that would normally produce a probability distribution as its output, and applying standard dropout at test time can be used to estimate characteristics of this underlying distribution.
- It is generally accepted that standard dropout can regularize a wide range of neural network models, but there is room to achieve either faster training convergence or better final performance.
- There are opportunities to develop improved methods that are specialized for particular kinds of networks or that use more advanced approaches for selecting neurons to drop.
- The growth of Bayesian interpretations of dropout methods over the last few years points to new opportunities in theoretical justifications of dropout and similar stochastic methods, which corresponds to a broader trend of Bayesian and variational techniques advancing research into deep neural networks.
Reference
- G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, 2012, pp. 1097– 1105.
- L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th International Conference on Machine Learning. PLMR, 2013.
- L. J. Ba and B. Frey, “Adaptive dropout for training deep neural networks,” in Proceedings of the 26th International Conference on Neural Information Processing Systems. NIPS, 2013.
- S. Wang and C. Manning, “Fast dropout training,” in Proceedings of the 30th International Conference on Machine Learning. PLMR, 2013.
- D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in Advances in Neural Information Processing Systems 28, 2015, pp. 2575–2583.
- Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of the 33rd International Conference on Machine Learning. PLMR, 2016.
- D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2498–2507.
- K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov, “Structured bayesian pruning via log-normal multiplicative noise,” in Advances in Neural Information Processing Systems 30, 2017, pp. 6775–6784.
- H. Salehinejad and S. Valaee, “Ising-dropout: A regularization method for training and compression of deep neural networks,” arXiv preprint arXiv:1902.08673, 2019.
- A. N. Gomez, I. Zhang, K. Swersky, Y. Gal, and G. E. Hinton, “Targeted dropout,” in 2018 CDNNRIA Workshop at the 32nd Conference on Neural Information Processing Systems. NeurIPS, 2018.
- H. Wu and X. Gu, “Towards dropout training for convolutional neural networks,” Neural Networks, vol. 71, no. C, pp. 1–10, 2015.
- J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object localization using convolutional networks.” in IEEE CVPR. IEEE, 2015, pp. 648–656.
- S. Park and N. Kwak, “Analysis on the dropout effect in convolutional neural networks,” in Asian Conference on Computer Vision. Springer, 2016, pp. 189–204.
- T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
- G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger, “Deep networks with stochastic depth,” arXiv preprint arXiv:1603.09382, 2016.
- S. Cai, J. Gao, M. Zhang, W. Wang, G. Chen, and B. C. Ooi, “Effective and efficient dropout for deep convolutional neural networks,” arXiv preprint arXiv:1904.03392, 2019.
- S. H. Khan, M. Hayat, and F. Porikli, “Regularization of deep neural networks with spectral dropout,” Neural Networks, vol. 110, pp. 82–90, 2019.
- S. Hou and Z. Wang, “Weighted channel dropout for regularization of deep convolutional neural network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
- T. Moon, H. Choi, H. Lee, and I. Song, “Rnndrop: A novel dropout for RNNs in ASR,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015.
- Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” in Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS, 2016.
- S. Semeniuta, A. Severyn, and E. Barth, “Recurrent dropout without memory loss,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics, 2016.
- D. Krueger, T. Maharaj, J. Kramar, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, H. Larochelle, A. C. Courville, and C. Pal, “Zoneout: Regularizing RNNs by randomly preserving hidden activations,” arXiv preprint arXiv:1606.01305, 2016.
- S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing LSTM language models,” arXiv preprint arXiv:1708.02182, 2017.
- G. Melis, C. Dyer, and P. Blunsom, “On the state of the art of evaluation in neural language models,” arXiv preprint arXiv:1707.05589, 2017.
- K. Zołna, D. Arpit, D. Suhubdy, and Y. Bengio, “Fraternal dropout,” arXiv preprint arXiv:1711.00066, 2018.
- N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
- D. Warde-Farley, I. J. Goodfellow, A. Courville, and Y. Bengio, “An empirical analysis of dropout in piecewise linear networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- P. Baldi and P. J. Sadowski, “Understanding dropout,” in Advances in Neural Information Processing Systems 26, 2013, pp. 2814–2822.
- S. Wager, S. Wang, and P. S. Liang, “Dropout training as adaptive regularization,” in Advances in neural information processing systems, 2013, pp. 351–359.
- D. P. Helmbold and P. M. Long, “On the inductive bias of dropout,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 3403–3454, 2015.
- X. Bouthillier, K. Konda, P. Vincent, and R. Memisevic, “Dropout as data augmentation,” arXiv preprint arXiv:1506.08700, 2015.
- A. Achille and S. Soatto, “Information dropout,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pp. 2897–2905, 2018.
- Z. Li, B. Gong, and T. Yang, “Improved dropout for shallow and deep learning,” in Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS, 2016.
- Y. Gal, J. Hron, and A. Kendall, “Concrete dropout,” in Advances in Neural Information Processing Systems 30, 2017, pp. 3581–3590.
- S. J. Rennie, V. Goel, and S. Thomas, “Annealed dropout training of deep networks,” in IEEE Spoken Language Technology Workshop (SLT), 2014.
- P. Morerio, J. Cavazza, R. Volpi, R. Vidal, and V. Murino, “Curriculum dropout,” arXiv preprint arXiv:1703.06229, 2017.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016, pp. 770–778.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol.
- 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 448–456.
- [40] S. Singh, D. Hoiem, and D. Forsyth, “Swapout: Learning an ensemble of deep architectures,” in Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS, 2016.
- [41] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2015.
- [42] H. Salehinejad, J. Baarbe, S. Sankar, J. Barfett, E. Colak, and S. Valaee, “Recent advances in recurrent neural networks,” arXiv preprint arXiv:1801.01078, 2017.
- [43] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
- [44] Y. Gal, “Uncertainty in deep learning,” Ph.D. dissertation, University of Cambridge, 2016.
- [45] L. Zhu and N. Laptev, “Deep and confident prediction for time series at Uber,” arXiv preprint arXiv:1709.01907, 2017.
- [46] A. Jungo, R. McKinley, R. Meier, U. Knecht, L. Vera, J. Perez-Beteta, D. Molina-Garcıa, V. M. Perez-Garcıa, R. Wiest, and M. Reyes, “Towards uncertainty-assisted brain tumor segmentation and survival prediction,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, A. Crimi, S. Bakas, H. Kuijf, B. Menze, and M. Reyes, Eds. Cham: Springer International Publishing, 2018, pp. 474–485.
- [47] B. Lakshminarayanan, A. Pritzel, and C. Blundel, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Proceedings of the 31th International Conference on Neural Information Processing Systems. NIPS, 2017.
- [48] S. Park, J. Park, S.-J. Shin, and I.-C. Moon, “Adversarial dropout for supervised and semisupervised learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- [49] K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Adversarial dropout regularization,” arXiv preprint arXiv:1711.01575, 2018.
- [50] S. Park, K. Song, M. Ji, W. Lee, and I.-C. Moon, “Adversarial dropout for recurrent neural networks,” arXiv preprint arXiv:1904.09816, 2019.
- [51] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
- [52] Y. Li and Y. Gal, “Dropout inference in Bayesian neural networks with alpha-divergences,” in Proceedings of the 34th international conference on machine learning (ICML’17), 2017, pp. 2052–2061.
Full Text
Tags
Comments