# Contextual Dropout: An Efficient Sample-Dependent Dropout Module

international conference on learning representations, 2021.

Weibo:

Abstract:

Dropout has been demonstrated as a simple and effective module to not only regularize the training process of deep neural networks, but also provide the uncertainty estimation for prediction. However, the quality of uncertainty estimation is highly dependent on the dropout probabilities. Most current models use the same dropout distributi...More

Code:

Data:

Introduction

- Deep neural networks (NNs) have become ubiquitous and achieved state-of-the-art results in a wide variety of research problems (LeCun et al, 2015).
- Applying dropout to a NN often means element-wisely reweighing each layer with a data-specific Bernoulli/Gaussian distributed random mask zi, which are iid drawn from a prior pη(z) parameterized by η (Hinton et al, 2012; Srivastava et al, 2014)
- This implies dropout training can be viewed as approximate Bayesian inference (Gal & Ghahramani, 2016).
- Whether the KL term is explicitly imposed is a key distinction between regular dropout (Hinton et al, 2012; Srivastava et al, 2014) and their Bayesian generalizations (Gal & Ghahramani, 2016; Gal et al, 2017; Kingma et al, 2015; Molchanov et al, 2017; Boluki et al, 2020)

Highlights

- Deep neural networks (NNs) have become ubiquitous and achieved state-of-the-art results in a wide variety of research problems (LeCun et al, 2015)
- We have proposed the contextual dropout as a simple and scalable data-dependent dropout module that achieves strong performance in both accuracy and uncertainty on a variety of tasks including large scale applications
- With an efficient parameterization of the coviariate-dependent variational distribution, contextual dropout boosts the flexibility of Bayesian neural networks with only slightly increasing memory and computational cost
- We demonstrate the general applicability of contextual dropout on fully connected, convolutional, and attention layers, and show that contextual dropout masks are compatible with both Bernoulli and Gaussian distribution
- On ImageNet, it is even possible to improve the performance of a pretrained model by adding the contextual dropout module during a finetuning stage
- We believe contextual dropout can serve as an efficient alternative to data-independent dropouts in the versatile tool box of dropout modules

Methods

- MC - BERNOULLI MC - GAUSSIAN CONCRETE BAYES BY BACKPROP CONTEXTUAL GATING CONTEXTUAL GATING+DROPOUT

Results

**Results and analysis**

In Table 1, the authors show accuracy, PAvPU (p-value threshold equal to 0.05) and test predictive loglikelihood with error bars (5 random runs) for models with different dropouts under the challenging noisy data.- The authors consistently observe contextual dropout outperforms other models in accuracy, uncertainty estimation, and loglikelihood.
- The authors observe (1) contextual dropout predicts the correct answer if it is certain, (2) contextual dropout is certain and predicts the correct answers on many images for which MC or concrete dropout is uncertain, (3) MC or concrete dropout is uncertain about some easy examples or certain on some wrong predictions, (4) on an image that all three methods have high uncertainty, contextual dropout places a higher probability on the correct answer than the other two.
- Visualization: In Figures 12-15 in Appendix F.3, the authors visualize some image-question pairs, along with the human annotations and compare the predictions and uncertainty estimations of different dropouts.
- As shown in the plots, overall contextual dropout is more conservative on its wrong predictions and more certain on its correct predictions than other methods(see more detailed explanations in Appendix F.3)

Conclusion

- The authors have proposed the contextual dropout as a simple and scalable data-dependent dropout module that achieves strong performance in both accuracy and uncertainty on a variety of tasks including large scale applications.
- On ImageNet, it is even possible to improve the performance of a pretrained model by adding the contextual dropout module during a finetuning stage.
- Based on these results, the authors believe contextual dropout can serve as an efficient alternative to data-independent dropouts in the versatile tool box of dropout modules

Summary

## Introduction:

Deep neural networks (NNs) have become ubiquitous and achieved state-of-the-art results in a wide variety of research problems (LeCun et al, 2015).- Applying dropout to a NN often means element-wisely reweighing each layer with a data-specific Bernoulli/Gaussian distributed random mask zi, which are iid drawn from a prior pη(z) parameterized by η (Hinton et al, 2012; Srivastava et al, 2014)
- This implies dropout training can be viewed as approximate Bayesian inference (Gal & Ghahramani, 2016).
- Whether the KL term is explicitly imposed is a key distinction between regular dropout (Hinton et al, 2012; Srivastava et al, 2014) and their Bayesian generalizations (Gal & Ghahramani, 2016; Gal et al, 2017; Kingma et al, 2015; Molchanov et al, 2017; Boluki et al, 2020)
## Objectives:

Parameter sharing between encoder and decoder: The authors aim to build an encoder by modeling qφ(zl | xl−1), where x may come from complex and highly structured data such as images and natural languages.## Methods:

MC - BERNOULLI MC - GAUSSIAN CONCRETE BAYES BY BACKPROP CONTEXTUAL GATING CONTEXTUAL GATING+DROPOUT## Results:

**Results and analysis**

In Table 1, the authors show accuracy, PAvPU (p-value threshold equal to 0.05) and test predictive loglikelihood with error bars (5 random runs) for models with different dropouts under the challenging noisy data.- The authors consistently observe contextual dropout outperforms other models in accuracy, uncertainty estimation, and loglikelihood.
- The authors observe (1) contextual dropout predicts the correct answer if it is certain, (2) contextual dropout is certain and predicts the correct answers on many images for which MC or concrete dropout is uncertain, (3) MC or concrete dropout is uncertain about some easy examples or certain on some wrong predictions, (4) on an image that all three methods have high uncertainty, contextual dropout places a higher probability on the correct answer than the other two.
- Visualization: In Figures 12-15 in Appendix F.3, the authors visualize some image-question pairs, along with the human annotations and compare the predictions and uncertainty estimations of different dropouts.
- As shown in the plots, overall contextual dropout is more conservative on its wrong predictions and more certain on its correct predictions than other methods(see more detailed explanations in Appendix F.3)
## Conclusion:

The authors have proposed the contextual dropout as a simple and scalable data-dependent dropout module that achieves strong performance in both accuracy and uncertainty on a variety of tasks including large scale applications.- On ImageNet, it is even possible to improve the performance of a pretrained model by adding the contextual dropout module during a finetuning stage.
- Based on these results, the authors believe contextual dropout can serve as an efficient alternative to data-independent dropouts in the versatile tool box of dropout modules

- Table1: Results on noisy MNIST with MLP
- Table2: Results on CIFAR-100 with WRN
- Table3: Results on ImageNet with ResNet-18
- Table4: Accuracy and PAvPU on visual question answering
- Table5: Model size comparison among different methods
- Table6: Complete results on MNIST with MLP
- Table7: Loglikelihood on original MNIST with MLP
- Table8: Complete results on CIFAR-10 with WRN
- Table9: Complete log likelihood results on CIFAR-10 with WRN
- Table10: Complete results on CIFAR-100 with WRN

Related work

- Data-dependent variational distribution: Deng et al (2018) model attentions as latent-alignment variables and optimize a tighter lower bound (compared to hard attention) using a learned inference network. To balance exploration and exploitation for contextual bandits problems, Wang & Zhou (2019) introduce local variable uncertainty under the Thompson sampling framework. However, their inference networks of are both independent of the decoder, which may considerably increase memory and computational cost for the considered applications. In addition, while the scope of Deng et al (2018) is limited to attention units and that of Wang & Zhou (2019) limited to contextual bandits, we demonstrate the general applicability of contextual dropout to fully connected, convolutional, and attention layers in supervised learning models. Conditional computation (Bengio et al, 2015; 2013; Shazeer et al, 2017; Teja Mullapudi et al, 2018) tries to increase model capacity without a proportional increase in computation, where an independent gating network decides turning which part of a network active and which inactive for each example. In contextual dropout, the encoder works much like a gating network choosing the distribution of sub-networks for each sample. But the potential gain in model capacity is even larger, e.g., there are potentially ∼ O((2d)L) combinations of nodes for L fully-connected layers, where d is the order of the number of nodes for one layer. Generalization of dropout: DropConnect (Wan et al, 2013) randomly drops the weights rather than the activations so as to generalize dropout. The dropout distributions for the weights, however, are still the same across different samples. Contextual dropout utilizes sample-dependent dropout probabilities, allowing different samples to have different dropout probabilities.

Funding

- As shown in Table 5 in Appendix, contextual dropout only introduces 16% additional parameters

Reference

- Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems, pp. 3084–3092, 2013.
- Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
- Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
- Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
- Shahin Boluki, Randy Ardywibowo, Siamak Zamani Dadaneh, Mingyuan Zhou, and Xiaoning Qian. Learnable Bernoulli dropout for Bayesian deep learning. In Artificial Intelligence and Statistics, 2020.
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255.
- Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander Rush. Latent alignment and variational attention. In Advances in Neural Information Processing Systems, pp. 9712–9724, 2018.
- Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
- Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pp. 3581–3590, 2017.
- Asghar Ghasemi and Saleh Zahediasl. Normality tests for statistical analysis: a guide for non-statisticians. International journal of endocrinology and metabolism, 10(2):486, 2012.
- Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913, 2017.
- Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356, 2011.
- Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. JMLR. org, 2017.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
- Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
- Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
- Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
- Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583, 2015.
- Alex Krizhevsky et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. arXiv preprint arXiv:1807.00263, 2018.
- Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413, 2017.
- Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, pp. 473–480, 2007.
- Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2:18, 2010.
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
- Chunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin. Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- Yang Li and Shihao Ji. L0-ARM: Network sparsification via stochastic binary optimization. In The European Conference on Machine Learning (ECML), 2019.
- Christos Louizos and Max Welling. Multiplicative normalizing flows for variational Bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2218–2227. JMLR. org, 2017.
- David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3): 448–472, 1992.
- Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2498–2507. JMLR. org, 2017.
- Jishnu Mukhoti and Yarin Gal. Evaluating bayesian deep learning methods for semantic segmentation. arXiv preprint arXiv:1811.12709, 2018.
- Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o (1/k 2). In Dokl. akad. nauk Sssr, volume 269, pp. 543–547, 1983.
- Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
- Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, pp. 1278–1286, 2014.
- Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Jiaxin Shi, Shengyang Sun, and Jun Zhu. Kernel implicit variational inference. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1l4eQW0Z.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1): 1929–1958, 2014.
- Ravi Teja Mullapudi, William R Mark, Noam Shazeer, and Kayvon Fatahalian. Hydranets: Specialized dynamic architectures for efficient inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8080–8089, 2018.
- Damien Teney, Peter Anderson, Xiaodong He, and Anton Van Den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232, 2018.
- Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656, 2015.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
- Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066, 2013.
- Zhendong Wang and Mingyuan Zhou. Thompson sampling via local uncertainty. arXiv preprint arXiv:1910.13673, 2019.
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688, 2011.
- Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015a.
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057, 2015b.
- Mingzhang Yin and Mingyuan Zhou. ARM: Augment-REINFORCE-merge gradient for discrete latent variable models. Preprint, May 2018.
- Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6281–6290, 2019.
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- In this section, we will explain the implementation details of ARM for Bernoulli contextual dropout. To compute the gradients with respect to the parameters of the variational distribution, a commonly used gradient estimator is the REINFORCE estimator (Williams, 1992) as
- This gradient estimator is, however, known to have high variance (Yin & Zhou, 2018). To mitigate this issue, we use ARM to compute the gradient with Bernoulli random variable.
- Li & Ji (2019) to facilitate the transition of probability between 0 and 1 for the purpose of pruning
- Choice of hyper-parameters in Contextual Dropout: Contextual dropout introduces two additional hyperparameters compared to regular dropout. One is the channel factor γ for the encoder network. In our experiments, the results are not sensitive to the choice of the value of the channel factor γ. Any number from 8 to 16 would give similar results, which is also observed in (Hu et al., 2018). The other is the sigmoid scaling factor t that controls the learning rate of the encoder. We find that the performance is not that sensitive to its value and it is often beneficial to make it smaller than the learning rate of the decoder. In all experiments considered in the paper, which cover various noise levels and model sizes, we have simply fixed it at t = 0.01.
- WRN: We consider WRN (Zagoruyko & Komodakis, 2016), including 25 convolutional layers. In Figure 6, we show the architecture of WRN, where dropout is applied to the first convolutional layer in each network block; in total, dropout is applied to 12 convolutional layers. We use CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) as benchmarks. All experiments are trained for 200 epochs with the Nesterov Momentum optimizer (Nesterov, 1983), whose base learning rate is set as 0.1, with decay factor 1/5 at epochs 60 and 120. All other hyperparameters are the same as MLP except for Gaussian dropout, where we use standard deviation equal to 0.8 for the CIFAR100 with no noise and 1 for all other cases.

Tags

Comments