Calibration of Pre trained Transformers

EMNLP 2020, pp. 295-302, 2020.

Cited by: 0|Bibtex|Views144|DOI:https://doi.org/10.18653/V1/2020.EMNLP-MAIN.21
Other Links: arxiv.org|academic.microsoft.com
Weibo:
We examine the calibration of pre-trained Transformers in both in-domain and out-of-domain settings

Abstract:

Pre-trained Transformers are now ubiquitous in natural language processing, but despite their high end-task performance, little is known empirically about whether they are calibrated. Specifically, do these models’ posterior probabilities provide an accurate empirical measure of how likely the model is to be correct on a given example? We...More

Code:

Data:

0
Introduction
  • Neural networks have seen wide adoption but are frequently criticized for being black boxes, offering little insight as to why predictions are made (Benitez et al, 1997; Dayhoff and DeLeo, 2001; Castelvecchi, 2016) and making it difficult to diagnose errors at test-time.
  • The authors evaluate the calibration of two pre-trained models, BERT (Devlin et al, 2019) and RoBERTa (Liu et al, 2019), on three tasks: natural language inference (Bowman et al, 2015), paraphrase detection (Iyer et al, 2017), and commonsense reasoning (Zellers et al, 2018)
  • These tasks represent standard evaluation settings for pretrained models, and critically, challenging out-ofdomain test datasets are available for each.
  • Such test data allows them to measure calibration in more realistic settings where samples stem from a dissimilar input distribution, which is exactly the scenario where the authors hope a well-calibrated model would avoid making confident yet incorrect predictions
Highlights
  • Neural networks have seen wide adoption but are frequently criticized for being black boxes, offering little insight as to why predictions are made (Benitez et al, 1997; Dayhoff and DeLeo, 2001; Castelvecchi, 2016) and making it difficult to diagnose errors at test-time
  • We evaluate the calibration of two pre-trained models, BERT (Devlin et al, 2019) and RoBERTa (Liu et al, 2019), on three tasks: natural language inference (Bowman et al, 2015), paraphrase detection (Iyer et al, 2017), and commonsense reasoning (Zellers et al, 2018)
  • In out-of-domain settings, where non-pre-trained models like ESIM (Chen et al, 2017) are over-confident, we find that pretrained models are significantly better calibrated
  • Posterior calibration is one lens to understand the trustworthiness of model confidence scores
  • We examine the calibration of pre-trained Transformers in both in-domain and out-of-domain settings
Methods
  • Model: RoBERTa placing a 1 − α fraction of probability mass on the gold label and α |Y |−1 fraction of mass on each other label, where α ∈ (0, 1) is a hyperparameter.3
  • This re-formulated learning objective does not require changing the model architecture.
  • RoBERTa with temperature-scaled MLE achieves ECE values from 0.7-0.8, implying that MLE training yields scores that are fundamentally good but just need some minor rescaling
Results
  • In out-of-domain settings, where non-pre-trained models like ESIM (Chen et al, 2017) are over-confident, the authors find that pretrained models are significantly better calibrated.
Conclusion
  • Posterior calibration is one lens to understand the trustworthiness of model confidence scores.
  • The authors examine the calibration of pre-trained Transformers in both in-domain and out-of-domain settings.
  • Results show BERT and RoBERTa coupled with temperature scaling achieve low ECEs in-domain, and when trained with label smoothing, are competitive out-of-domain
Summary
  • Introduction:

    Neural networks have seen wide adoption but are frequently criticized for being black boxes, offering little insight as to why predictions are made (Benitez et al, 1997; Dayhoff and DeLeo, 2001; Castelvecchi, 2016) and making it difficult to diagnose errors at test-time.
  • The authors evaluate the calibration of two pre-trained models, BERT (Devlin et al, 2019) and RoBERTa (Liu et al, 2019), on three tasks: natural language inference (Bowman et al, 2015), paraphrase detection (Iyer et al, 2017), and commonsense reasoning (Zellers et al, 2018)
  • These tasks represent standard evaluation settings for pretrained models, and critically, challenging out-ofdomain test datasets are available for each.
  • Such test data allows them to measure calibration in more realistic settings where samples stem from a dissimilar input distribution, which is exactly the scenario where the authors hope a well-calibrated model would avoid making confident yet incorrect predictions
  • Methods:

    Model: RoBERTa placing a 1 − α fraction of probability mass on the gold label and α |Y |−1 fraction of mass on each other label, where α ∈ (0, 1) is a hyperparameter.3
  • This re-formulated learning objective does not require changing the model architecture.
  • RoBERTa with temperature-scaled MLE achieves ECE values from 0.7-0.8, implying that MLE training yields scores that are fundamentally good but just need some minor rescaling
  • Results:

    In out-of-domain settings, where non-pre-trained models like ESIM (Chen et al, 2017) are over-confident, the authors find that pretrained models are significantly better calibrated.
  • Conclusion:

    Posterior calibration is one lens to understand the trustworthiness of model confidence scores.
  • The authors examine the calibration of pre-trained Transformers in both in-domain and out-of-domain settings.
  • Results show BERT and RoBERTa coupled with temperature scaling achieve low ECEs in-domain, and when trained with label smoothing, are competitive out-of-domain
Tables
  • Table1: Decomposable Attention (DA) (<a class="ref-link" id="cParikh_et+al_2016_a" href="#rParikh_et+al_2016_a">Parikh et al, 2016</a>) and Enhanced Sequential Inference Model (ESIM) (<a class="ref-link" id="cChen_et+al_2017_a" href="#rChen_et+al_2017_a">Chen et al, 2017</a>) use LSTMs and attention on top of GloVe embeddings (<a class="ref-link" id="cPennington_et+al_2014_a" href="#rPennington_et+al_2014_a">Pennington et al, 2014</a>) to model pairwise semantic similarities. In contrast, BERT (<a class="ref-link" id="cDevlin_et+al_2019_a" href="#rDevlin_et+al_2019_a">Devlin et al, 2019</a>) and RoBERTa (<a class="ref-link" id="cLiu_et+al_2019_a" href="#rLiu_et+al_2019_a">Liu et al, 2019</a>) are large-scale, pre-trained language models with stacked, general purpose Transformer (<a class="ref-link" id="cVaswani_et+al_2017_a" href="#rVaswani_et+al_2017_a">Vaswani et al, 2017</a>) layers
  • Table2: Out-of-the-box calibration results for indomain (SNLI, QQP, SWAG) and out-of-domain (MNLI, TwitterPPDB, HellaSWAG) datasets using the models described in Table 1. We report accuracy and expected calibration error (ECE), both averaged across 5 runs with random restarts
  • Table3: Post-hoc calibration results for BERT and RoBERTa on in-domain (SNLI, QQP, SWAG) and out-ofdomain (MNLI, TwitterPPDB, HellaSWAG) datasets. Models are trained with maximum likelihood estimation (MLE) or label smoothing (LS), then their logits are post-processed using temperature scaling (§4.4). We report expected calibration error (ECE) averaged across 5 runs with random restarts. Darker colors imply lower ECE
  • Table4: Learned temperature scaling values for BERT and RoBERTa on in-domain (SNLI, QQP, SWAG) and out-of-domain (MNLI, TwitterPPDB, HellaSWAG) datasets. Values are obtained by line search with a granularity of 0.01. Evaluations are very fast as they only require rescaling cached logits
  • Table5: Training, development, and test dataset sizes for SNLI (<a class="ref-link" id="cBowman_et+al_2015_a" href="#rBowman_et+al_2015_a">Bowman et al, 2015</a>), MNLI (<a class="ref-link" id="cWilliams_et+al_2018_a" href="#rWilliams_et+al_2018_a">Williams et al, 2018</a>), QQP (<a class="ref-link" id="cIyer_et+al_2017_a" href="#rIyer_et+al_2017_a">Iyer et al, 2017</a>), TwitterPPDB (<a class="ref-link" id="cLan_et+al_2017_a" href="#rLan_et+al_2017_a">Lan et al, 2017</a>), SWAG (<a class="ref-link" id="cZellers_et+al_2018_a" href="#rZellers_et+al_2018_a">Zellers et al, 2018</a>), and HellaSWAG (<a class="ref-link" id="cZellers_et+al_2019_a" href="#rZellers_et+al_2019_a">Zellers et al, 2019</a>)
Download tables as Excel
Related work
Funding
  • This work was partially supported by NSF Grant IIS-1814522 and a gift from Arm
  • The authors acknowledge a DURIP equipment grant to UT Austin that provided computational resources to conduct this research
Study subjects and analysis
samples: 100
A model is calibrated if the confidence estimates of its predictions are aligned with empirical likelihoods. For example, if we take 100 samples where a model’s prediction receives posterior probability. 0.7, the model should get 70 of the samples correct

Reference
  • Jose M. Benitez, Juan Luis Castro, and I. Requena. 1997. Are Artificial Neural Networks Black Boxes? IEEE Transactions on Neural Networks and Learning Systems, 8(5):11561164.
    Google ScholarLocate open access versionFindings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A Large Annotated Corpus for Learning Natural Language Inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Glenn W. Brier. 1950. Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1):1–3.
    Google ScholarLocate open access versionFindings
  • Davide Castelvecchi. 2016. Can We Open the Black Box of AI? Nature News, 538(7623):20.
    Google ScholarLocate open access versionFindings
  • Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for Natural Language Inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1657–1668, Vancouver, Canada. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger. 2018. Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification. Transactions of the Association for Computational Linguistics, 6:557–570.
    Google ScholarLocate open access versionFindings
  • Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pretraining Text Encoders as Discriminators Rather Than Generators. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Judith E. Dayhoff and James M. DeLeo. 2001. Artificial Neural Networks: Opening the Black Box. Cancer: Interdisciplinary International Journal of the American Cancer Society, 91(S8):1615–1635.
    Google ScholarLocate open access versionFindings
  • Shrey Desai, Hongyuan Zhan, and Ahmed Aly. 2019. Evaluating Lottery Tickets Under Distributional Shifts. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 153–162, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A Deep Semantic Natural Language Processing Platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 1– 6, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. 2007. Probabilistic Forecasts, Calibration and Sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243– 268.
    Google ScholarLocate open access versionFindings
  • Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330, International Convention Centre, Sydney, Australia. PMLR.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks and Kevin Gimpel. 2016. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Shankar Iyer, Nikhil Dandekar, and Kornl Csernai. 2017. Quora Question Pairs.
    Google ScholarFindings
  • Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. 2012. Calibrating Predictive Model Estimates to Support Personalized Medicine. In JAMIA.
    Google ScholarFindings
  • Alex Kendall and Yarin Gal. 2017. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5574–5584. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 20Revealing the Dark Secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4365–4374, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Aviral Kumar and Sunita Sarawagi. 2019. Calibration of Encoder Decoder Models for Neural Machine Translation. In Proceedings of the ICLR 2019 Debugging Machine Learning Models Workshop.
    Google ScholarLocate open access versionFindings
  • Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. A Continuously Growing Dataset of Sentential Paraphrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1224–1234, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. 2018. Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Shiyu Liang, Yixuan Li, and R. Srikant. 2018. Enhancing the Reliability of Out-of-distribution Image Detection in Neural Networks. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • David J. Miller, Ajit V. Rao, Kenneth Rose, and Allen Gersho. 1996. A Global Optimization Technique for Statistical Classifier Design. IEEE Transactions on Signal Processing, 44:3108–3122.
    Google ScholarLocate open access versionFindings
  • Timothy Miller. 2019. Simplified Neural Unsupervised Domain Adaptation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 414–419, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Khanh Nguyen and Brendan O’Connor. 2015. Posterior Calibration and Exploratory Analysis for Natural Language Processing Models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1587–1598, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tim Palmer, Francisco Doblas-Reyes, Antje Weisheimer, and Mark Rodwell. 2008. Toward Seamless Prediction: Calibration of Climate Change Projections using Seasonal Forecasts. Bulletin of the American Meteorological Society, 89(4):459–470.
    Google ScholarLocate open access versionFindings
  • Ankur Parikh, Oscar Tackstrom, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model for Natural Language Inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249–2255, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Minlong Peng, Qi Zhang, Yu-gang Jiang, and Xuanjing Huang. 2018. Cross-Domain Sentiment Classification with Target Domain Specific Information. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2505–2513, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Gabriel Pereyra, George Tucker, Jan Chorowski, ukasz Kaiser, and Geoffrey Hinton. 2017. Regularizing Neural Networks by Penalizing Confident Output Distributions. arXiv preprint arXiv:1701.06548.
    Findings
  • Adrian E. Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski. 2005. Using Bayesian Model Averaging to Calibrate Forecast Ensembles. Monthly Weather Review, 133(5):1155– 1174.
    Google ScholarLocate open access versionFindings
  • Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to Fine-Tune BERT for Text Classification? arXiv preprint arXiv:1905.05583.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3266– 3280. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American
    Google ScholarLocate open access versionFindings
  • Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rmi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv preprint arXiv:1910.03771.
    Findings
  • Huiqin Yang and Carl Thompson. 2010. Nurses’ Risk Assessment Judgements: A Confidence Calibration Study. Journal of Advanced Nursing, 66(12):2751– 2760.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 5753– 5763. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93– 104, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791– 4800, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • For non-pre-trained model baselines, we chiefly use the open-source implementations of DA (Parikh et al., 2016) and ESIM (Chen et al., 2017) in AllenNLP (Gardner et al., 2018). For SWAG/HellaSWAG specifically, we run the baselines available in the authors’ code.4 For BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), we use bert-base-uncased and roberta-base, respectively, from HuggingFace Transformers (Wolf et al., 2019). BERT is fine-tuned with a maximum of 3 epochs, batch size of 16, learning rate of 2e-5, gradient clip of 1.0, and no weight decay. Similarly, RoBERTa is fine-tuned with a maximum of 3 epochs, batch size of 32, learning rate of 1e-5, gradient clip of 1.0, and weight decay of 0.1. Both models are optimized with AdamW (Loshchilov and Hutter, 2019). Other than early stopping on the development set, we do not perform additional hyperparameter searches. Finally, all experiments are conducted on NVIDIA V100 32GB GPUs.
    Google ScholarLocate open access versionFindings
  • Reliability diagrams (Nguyen and O’Connor, 2015; Guo et al., 2017) visualize the alignment between posterior probabilities (confidence) and empirical outcomes (accuracy), where a perfectly calibrated model has conf(k) = acc(k) for each bucket of real-valued predictions k (§3). We show several reliability diagrams, each under different configurations, in Figures 1, 2, and 3.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments