AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our key finding is that ensemble distillation may be used to produce a single model that preserves much of the improved calibration and performance of the ensemble while being as efficient as single models at inference time

Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast—Choose Three

EMNLP 2020, pp.5583-5595, (2020)

Cited by: 0|Views96
Full Text
Bibtex
Weibo

Abstract

Modern neural networks do not always produce well-calibrated predictions, even when trained with a proper scoring function such as cross-entropy. In classification settings, simple methods such as isotonic regression or temperature scaling may be used in conjunction with a held-out dataset to calibrate model outputs. However, extending th...More

Code:

Data:

0
Introduction
  • An event with a forecast confidence p occurs in held-out data with probability p.
  • An alternate approach is model ensembling, which is closely related to approximating the intractable posterior distribution over model parameters (Lakshminarayanan et al, 2017; Pearce et al, 2018; Dusenberry et al, 2020)
  • Computationally expensive, both at training and inference time, ensembling does not require a separate calibration set.
  • Ensembles have been found to be competitive or even outperform other calibration methods, in more challenging settings such as dataset shift (Snoek et al, 2019)
Highlights
  • For a calibrated model, an event with a forecast confidence p occurs in held-out data with probability p
  • We investigate methods to produce effective ensembles in structured prediction settings, finding that small numbers of independent models initialized from different random seeds outperform an alternative based on single optimization trajectories (§6.1)
  • We report the results of single models, ensembles, and distilled ensembles on F1, BS, BS+, Balanced Brier score” (B-BS), and B-Expected Calibration Error (ECE) in Table 1
  • We find that individual models trained with label smoothing have slightly better BLEU scores and calibration than those trained without, which is consistent with the findings in (Muller et al, 2019), in which they attribute this improvement to reducing overconfidence
  • Our key finding is that ensemble distillation may be used to produce a single model that preserves much of the improved calibration and performance of the ensemble while being as efficient as single models at inference time
  • We show that calibration of the single student models can be further improved by other, orthogonal, re-calibration methods
Methods
  • As the authors found label smoothing to significantly hurt ensemble calibration (Table 2), the distillation experiments only consider the CE-SWA ensembles as teachers.
  • The negative log-likelihood loss with weight 1 − β is identical to other models and uses label smoothing with λ = 0.1.
  • All results use a weight of β = 0.5 on the distillation objective and use a random initialization of the model parameters, which preliminary experiments suggested was optimal.12.
Results
  • The authors report the results of single models, ensembles, and distilled ensembles on F1, BS-, BS+, B-BS, and B-ECE in Table 1.
  • The authors find that individual models trained with label smoothing have slightly better BLEU scores and calibration than those trained without, which is consistent with the findings in (Muller et al, 2019), in which they attribute this improvement to reducing overconfidence.
  • The authors hypothesize that penalizing overconfidence is effective for improving calibration of a single model, but that this results in overcorrection when models which have been penalized are ensembled together.
  • This is supported by the reliability plots in Figure 2, which show that the individual LS models are underconfident in their top predictions, which is compounded by ensembling, whereas non-LS individual models are slightly overconfident in their top predictions, which is corrected by ensembling
Conclusion
  • Tion of structured prediction models, which consistently improve calibration and performance relative to single models.
  • The authors' key finding is that ensemble distillation may be used to produce a single model that preserves much of the improved calibration and performance of the ensemble while being as efficient as single models at inference time.
  • The authors show that calibration of the single student models can be further improved by other, orthogonal, re-calibration methods.
Summary
  • Introduction:

    An event with a forecast confidence p occurs in held-out data with probability p.
  • An alternate approach is model ensembling, which is closely related to approximating the intractable posterior distribution over model parameters (Lakshminarayanan et al, 2017; Pearce et al, 2018; Dusenberry et al, 2020)
  • Computationally expensive, both at training and inference time, ensembling does not require a separate calibration set.
  • Ensembles have been found to be competitive or even outperform other calibration methods, in more challenging settings such as dataset shift (Snoek et al, 2019)
  • Objectives:

    The authors' objective is to compute the predictive uncertainty of pθ over of a finite sample of held-out data, {(X(i), Y (i))}Ni=1 of size N.
  • As the goal is to improve model calibration, which captures both types of uncertainty, the authors follow previous methods of ensemble distillation which collapse the ensemble distribution into a point-estimate by uniformly averaging the distributions of each teacher
  • Methods:

    As the authors found label smoothing to significantly hurt ensemble calibration (Table 2), the distillation experiments only consider the CE-SWA ensembles as teachers.
  • The negative log-likelihood loss with weight 1 − β is identical to other models and uses label smoothing with λ = 0.1.
  • All results use a weight of β = 0.5 on the distillation objective and use a random initialization of the model parameters, which preliminary experiments suggested was optimal.12.
  • Results:

    The authors report the results of single models, ensembles, and distilled ensembles on F1, BS-, BS+, B-BS, and B-ECE in Table 1.
  • The authors find that individual models trained with label smoothing have slightly better BLEU scores and calibration than those trained without, which is consistent with the findings in (Muller et al, 2019), in which they attribute this improvement to reducing overconfidence.
  • The authors hypothesize that penalizing overconfidence is effective for improving calibration of a single model, but that this results in overcorrection when models which have been penalized are ensembled together.
  • This is supported by the reliability plots in Figure 2, which show that the individual LS models are underconfident in their top predictions, which is compounded by ensembling, whereas non-LS individual models are slightly overconfident in their top predictions, which is corrected by ensembling
  • Conclusion:

    Tion of structured prediction models, which consistently improve calibration and performance relative to single models.
  • The authors' key finding is that ensemble distillation may be used to produce a single model that preserves much of the improved calibration and performance of the ensemble while being as efficient as single models at inference time.
  • The authors show that calibration of the single student models can be further improved by other, orthogonal, re-calibration methods.
Tables
  • Table1: Ensemble and Ensemble-Distillation results on CoNLL NER. All values are percentages. Bold results represent the best results of each model (IID or CRF) for each metric. Note that ensembles have higher F1 and are better calibrated than individual models. Furthermore, the distilled ensemble also significantly outperforms single models in all metrics. Surprisingly, distilling token-level CRF distributions can boost student models past the ensembles abilities. Dev results for these experiments are in Appendix D
  • Table2: Performance of Transformer-Base ensembles and individual models on the WMT16 English → German (a) and German → English (b) tasks. ECE values are given as percentages. LS+SWA and CE+SWA indicate models trained with and without label smoothing, respectively. Additionally, we report the performance of distilling CE-SWA ensembles into a single student model (see §5.2 for details). Similar to our NER results, we find that distillation is able to retain much of the benefits of an ensemble, both in terms of performance and calibration, over individual models. The best single-model performance is in bold
  • Table3: Distillation performance for De→En as the truncation V is varied. An ensemble of 7 models is used as the teacher
  • Table4: Single-run ensemble performance for NMT. We include the performance of the 3-model ensemble and the average individual model performance. We find that single-run ensembles have better calibration than single models, but do not see the same performance gains that true ensembles do
  • Table5: CoNLL-2003 German IID results for individual models, 9-ensembles, and distilled 9-ensembles with and without temperature scaling (TS). We find that we can utilize temperature scaling in all cases to boost calibration, but temperature scaling only helps overall performance when used in combination with distillation
  • Table6: Additional results for the WMT14 English → German task
  • Table7: Results for experiments on the CoNLL-2003 German dataset in which label smoothing was used. All models are have the IID architecture. Where applicable, the label smoothing factor α = 0.1
  • Table8: Dev set results for the models reported in Table 1
Download tables as Excel
Funding
  • This work was supported, in part, by the Human Language Technology Center of Excellence at Johns Hopkins University
Reference
  • Riccardo Benedetti. 2010. Scoring rules for forecast verification. Monthly Weather Review, 138(1):203– 211.
    Google ScholarLocate open access versionFindings
  • Glenn W. Brier. 1950. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3.
    Google ScholarLocate open access versionFindings
  • Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. arXiv preprint arXiv:2003.07892.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Michael W. Dusenberry, Ghassen Jerfel, Yeming Wen, Yian Ma, Jasper Snoek, Katherine Heller, Balaji Lakshminarayanan, and Dustin Tran. 2020. Efficient and scalable bayesian neural nets with rank1 factors. In International Conference on Machine Learning (ICML).
    Google ScholarLocate open access versionFindings
  • Erik Englesson and Hossein Azizpour. 2019. Efficient evaluation-time uncertainty estimation by improved distillation. In ICML 2019 Workshop on Uncertainty and Robustness in Deep Learning.
    Google ScholarLocate open access versionFindings
  • Peter Flach and Meelis Kull. 2015. Precision-recallgain curves: PR analysis done right. In Advances in neural information processing systems, pages 838– 846.
    Google ScholarLocate open access versionFindings
  • Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6114– 6123.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017. Nonautoregressive neural machine translation. arXiv preprint arXiv:1711.02281.
    Findings
  • Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, page 1026–1034, USA. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
    Findings
  • Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pages 876–885. Association For Uncertainty in Artificial Intelligence (AUAI). 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018; Conference date: 06-08-2018 Through 10-08-2018.
    Google ScholarLocate open access versionFindings
  • Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. 2012. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association: JAMIA, 19:263 – 274.
    Google ScholarLocate open access versionFindings
  • Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling.
    Google ScholarFindings
  • Yoon Kim and Alexander M. Rush. 20Sequencelevel knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Anoop Korattikara, Vivek Rathod, Kevin Murphy, and Max Welling. 2015. Bayesian dark knowledge. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 3438–3446, Cambridge, MA, USA. MIT Press.
    Google ScholarLocate open access versionFindings
  • Aviral Kumar and Sunita Sarawagi. 2019. Calibration of encoder decoder models for neural machine translation. arXiv preprint arXiv:1903.00802.
    Findings
  • Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhizhong Li and Derek Hoiem. 2019. Reducing overconfident errors outside the known distribution. In ICLR.
    Google ScholarFindings
  • Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
    Findings
  • Andrey Malinin and Mark Gales. 2018. Predictive uncertainty estimation via prior networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7047–7058. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. 2020. Ensemble distribution distillation. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Rafael Muller, Simon Kornblith, and Geoffrey E Hinton. 2019. When does label smoothing help? In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 4694–4703. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. 2015. Obtaining well calibrated probabilities using Bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, page 2901–2907. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Khanh Nguyen and Brendan O’Connor. 2015. Posterior calibration and exploratory analysis for natural language processing models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1587–1598, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. 2018a. Analyzing uncertainty in neural machine translation. In ICML.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018b. Scaling neural machine translation. arXiv preprint arXiv:1806.00187.
    Findings
  • Tim Pearce, Mohamed Zaki, Alexandra Brintrup, Nicolas Anastassacos, and Andy Neely. 2018. Uncertainty in neural networks: Bayesian ensembling. arXiv preprint arXiv:1810.05546.
    Findings
  • John C. Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In ADVANCES IN LARGE MARGIN CLASSIFIERS, pages 61–74. MIT Press.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
    Findings
  • Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition.
    Google ScholarFindings
  • Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. 2019. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13969–13980.
    Google ScholarLocate open access versionFindings
  • Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. 2019. Insertion transformer: Flexible sequence generation via insertion operations. arXiv preprint arXiv:1902.03249.
    Findings
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the Inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826.
    Google ScholarLocate open access versionFindings
  • Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Byron C Wallace and Issa J Dahabreh. 2014. Improving class probability estimates for imbalanced data. Knowledge and information systems, 41(1):33–52.
    Google ScholarLocate open access versionFindings
  • R. J. Williams and D. Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280.
    Google ScholarLocate open access versionFindings
  • Bianca Zadrozny and Charles Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 609–616, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
    Google ScholarLocate open access versionFindings
Author
Steven Reich
Steven Reich
David Mueller
David Mueller
Nicholas Andrews
Nicholas Andrews
Your rating :
0

 

Tags
Comments
小科