We introduced the accuracy versus uncertainty calibration loss and proposed novel optimization methods AvUC and AvU temperature scaling for improving uncertainty calibration in deep neural networks
Improving model calibration with accuracy versus uncertainty optimization
NIPS 2020, (2020)
Obtaining reliable and accurate quantification of uncertainty estimates from deep neural networks is important in safety-critical applications. A well-calibrated model should be accurate when it is certain about its prediction and indicate high uncertainty when it is likely to be inaccurate. Uncertainty calibration is a challenging prob...更多
下载 PDF 全文
- Probabilistic deep neural networks (DNNs) enable quantification of principled uncertainty estimates, which are essential to understand the model predictions for reliable decision making in safety critical applications .
- Approximate Bayesian inference methods are promising, but they may fail to provide calibrated uncertainty in between separated regions of observations as they tend to fit an approximation to a local mode and do not capture the complete true posterior [9, 15, 16, 32]
- This may cause the model to be overconfident under distributional shift.
- Existing calibration methods do not explicitly account for the quality of predictive uncertainty estimates while training the model, or post-hoc calibration
- Probabilistic deep neural networks (DNNs) enable quantification of principled uncertainty estimates, which are essential to understand the model predictions for reliable decision making in safety critical applications 
- We compare the proposed methods with various high performing non-Bayesian and Bayesian methods including vanilla DNN (Vanilla), Temperature scaling (Temp scaling) , Deep-ensembles (Ensemble) , Monte Carlo dropout (Dropout) , Mean-field stochastic variational inference (SVI) [2, 3], Temperature scaling on SVI (SVI-TS) and Radial Bayesian neural network (Radial BNN) 
- In addition to SVI-accuracy versus uncertainty calibration (AvUC) and SVI-AvU temperature scaling (AvUTS), we evaluate AvUC and AvUTS methods applied to vanilla baseline with entropy of softmax used as the predictive uncertainty in computing AvUC loss, which is combined with the cross-entropy loss
- We introduced the accuracy versus uncertainty calibration (AvUC) loss and proposed novel optimization methods AvUC and AvUTS for improving uncertainty calibration in deep neural networks
- Uncertainty calibration is important for reliable and informed decision making in safety critical applications, we envision AvUC as a step towards advancing probabilistic deep neural networks in providing well-calibrated uncertainties along with improved accuracy
- We demonstrated our method SVI-AvUC provides better model calibration than existing state-of-the-art methods under distributional shift
- ECE (%)↓ at various datashift intensities
UCE (%)↓ at various datashift intensities
Vanilla Vanilla-AvuTS Vanilla-AvUC
In addition to SVI-AvUC and SVI-AvUTS, the authors evaluate AvUC and AvUTS methods applied to vanilla baseline with entropy of softmax used as the predictive uncertainty in computing AvUC loss, which is combined with the cross-entropy loss.
- Table 1 shows AvUTS and AvUC improves the model calibration errors (ECE and UCE) on the vanilla baseline as well.
- Figures 2(d), (e), (f) shows SVI-AvUC is more uncertain when making inaccurate predictions under distributional shift, compared to other methods.
- Figures 2(g) and (h) show SVI-AvUC has lesser number of examples with higher confidence when model accuracy is low under distributional shift.
- SVI-AvUC outperforms other methods in providing calibrated confidence and uncertainty measures under distributional shift
- The authors perform a thorough empirical evaluation of the proposed methods SVI-AvUC and SVI-AvUTS on large-scale image classification task under distributional shift.
- The results for the methods: Vanilla, Temp scaling, Ensemble, Dropout, LL Dropout and LL SVI are obtained from the model predictions provided in UQ benchmark  and the authors follow the same methodology for model evaluation under distributional shift by utilizing 16 different types of image corruptions at 5 different levels of intensities for each datashift type proposed in , resulting in 80 variations of test data for datashift evaluation.
- The authors provide details of the model implementations and hyperparameters for SVI, SVI-TS, SVI-AvUC, SVI-AvUTS and Radial BNN in Appendix B
- The authors introduced the accuracy versus uncertainty calibration (AvUC) loss and proposed novel optimization methods AvUC and AvUTS for improving uncertainty calibration in deep neural networks.
- Uncertainty calibration is important for reliable and informed decision making in safety critical applications, the authors envision AvUC as a step towards advancing probabilistic deep neural networks in providing well-calibrated uncertainties along with improved accuracy.
- The authors demonstrated the method SVI-AvUC provides better model calibration than existing state-of-the-art methods under distributional shift.
- The authors have made the code 3 available to facilitate probabilistic deep learning community to evaluate and improve model calibration for various other baselines
- Table1: Additional results evaluating AvUC and AvUTS methods applied to Vanilla baseline on CIFAR10. Vanilla-AvUTS and Vanilla-AvUC provides lower ECE and UCE (mean across 16 different data shift types) compared to the baseline
- Table2: Distributional shift detection using predictive uncertainty. For dataset shift detection on ImageNet and CIFAR10, we use test data shifted with Gaussian blur of intensity 5. SVHN is used as out-of-distribution(OOD) data for OOD detection on model trained with CIFAR10. All values are in percentages and best results are indicated in bold. SVI-AvUC outperforms across all the metrics
- Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521(7553):452–459, 2015.
- Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011.
- Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015.
- Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
- Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
- Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pages 13132–13143, 2019.
- Raanan Yehezkel Rohekar, Yaniv Gurwicz, Shami Nisimov, and Gal Novik. Modeling uncertainty by learning a hierarchy of deep neural connections. In Advances in Neural Information Processing Systems, pages 4246–4256, 2019.
- Sebastian Farquhar, Michael Osborne, and Yarin Gal. Radial bayesian neural networks: Beyond discrete support in large-scale bayesian deep learning. Proceedings of the 23rtd International Conference on Artificial Intelligence and Statistics, 2020.
- Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402–6413, 2017.
- Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-ofdistribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167–7177, 2018.
- Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017.
- Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pages 2805–2814, 2018.
- Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip HS Torr, and Puneet K Dokania. Calibrating deep neural networks using focal loss. arXiv preprint arXiv:2002.09437, 2020.
- Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2796–2804. PMLR, 2018.
- Andrew YK Foong, Yingzhen Li, José Miguel Hernández-Lobato, and Richard E Turner. ’inbetween’uncertainty in bayesian neural networks. arXiv preprint arXiv:1906.11537, 2019.
- Jonathan Heek. Well-calibrated bayesian neural networks. University of Cambridge, 2018.
- Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems, pages 3787–3798, 2019.
- Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems, pages 12295–12305, 2019.
- Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In Advances in Neural Information Processing Systems, pages 13888–13899, 2019.
- Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019.
- Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple method to improve robustness and uncertainty under data shift. In International Conference on Learning Representations, 2020.
- Jose G Moreno-Torres, Troy Raeder, RocíO Alaiz-RodríGuez, Nitesh V Chawla, and Francisco Herrera. A unifying view on dataset shift in classification. Pattern recognition, 45(1):521–530, 2012.
- Michael A Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-Shinn Ku, and Anh Nguyen. Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4845–4854, 2019.
- Hermann Blum, Paul-Edouard Sarlin, Juan Nieto, Roland Siegwart, and Cesar Cadena. The fishyscapes benchmark: measuring blind spots in semantic segmentation. arXiv preprint arXiv:1904.03215, 2019.
- Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13969–13980, 2019.
- Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani. Approximate inference for the losscalibrated bayesian. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 416–424, 2011.
- Adam D Cobb, Stephen J Roberts, and Yarin Gal. Loss-calibrated approximate inference in bayesian neural networks. arXiv preprint arXiv:1805.03901, 2018.
- James O Berger. Statistical Decision Theory and Bayesian Analysis. Springer, 1985.
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
- Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pages 1683–1691, 2014.
- Lewis Smith and Yarin Gal. Understanding measures of uncertainty for adversarial example detection. arXiv preprint arXiv:1803.08533, 2018.
- Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter? Structural safety, 31(2): 105–112, 2009.
- Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017.
- Yarin Gal. Uncertainty in deep learning. PhD thesis, University of Cambridge, 2016.
- Claude E Shannon. A mathematical theory of communication. Bell system technical journal, 27(3): 379–423, 1948.
- Linton G Freeman. Elementary applied statistics. John Wiley and Sons, 1965.
- Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
- Jishnu Mukhoti and Yarin Gal. Evaluating bayesian deep learning methods for semantic segmentation. arXiv preprint arXiv:1811.12709, 2018.
- Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Max-Heinrich Laves, Sontje Ihler, Karl-Philipp Kortmann, and Tobias Ortmaier. Well-calibrated model uncertainty with temperature scaling for dropout variational inference. arXiv preprint arXiv:1909.13550, 2019.
- Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
- Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1): 1–3, 1950.
- Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240, 2006.
- Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3), 2015.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In International Conference on Learning Representations, 2018.
- Mahesh Subedar, Ranganath Krishnan, Paulo Lopez Meyer, Omesh Tickoo, and Jonathan Huang. Uncertainty-aware audiovisual activity recognition using deep bayesian variational inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Ranganath Krishnan, Mahesh Subedar, and Omesh Tickoo. Specifying weight priors in bayesian deep neural networks with empirical bayes. Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
- Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
- Aaditya Ramdas, Nicolás García Trillos, and Marco Cuturi. On wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47, 2017.
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pages 8026–8037, 2019.
- Stephen Kokoska and Daniel Zwillinger. CRC standard probability and statistics tables and formulae. Crc Press, 2000.