Intra Order-preserving Functions for Calibration of Multi-Class Neural Networks

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views11
EI
Other Links: dblp.uni-trier.de
Weibo:
We introduce the family of intra orderpreserving functions which retain the top-k predictions of any deep network when used as the post-hoc calibration function

Abstract:

Predicting calibrated confidence scores for multi-class deep networks is important for avoiding rare but costly mistakes. A common approach is to learn a post-hoc calibration function that transforms the output of the original network into calibrated confidence scores while maintaining the network's accuracy. However, previous post-hoc ca...More

Code:

Data:

0
Introduction
  • Deep neural networks have demonstrated impressive accuracy in classification tasks, such as image recognition (He et al, 2016; Ren et al, 2015) and medical research (Jiang et al, 2012; Caruana et al, 2015)
  • These exciting results have recently motivated engineers to adopt deep networks as default components in building decision systems; for example, a multi-class neural network can be treated as a probabilistic predictor and its softmax output can provide the confidence scores of different actions for the downstream decision making pipeline (Girshick, 2015; Cao et al, 2017; Mozafari et al, 2019).
Highlights
  • Deep neural networks have demonstrated impressive accuracy in classification tasks, such as image recognition (He et al, 2016; Ren et al, 2015) and medical research (Jiang et al, 2012; Caruana et al, 2015)
  • As default components in building decision systems; for example, a multi-class neural network can be treated as a probabilistic predictor and its softmax output can provide the confidence scores of different actions for the downstream decision making pipeline (Girshick, 2015; Cao et al, 2017; Mozafari et al, 2019)
  • We identify necessary and sufficient conditions for describing intra order-preserving functions, and propose a novel neural network architecture that can represent complex intra order-preserving function through common neural network components
  • We introduce the family of intra orderpreserving functions which retain the top-k predictions of any deep network when used as the post-hoc calibration function
  • We propose a new neural network architecture to represent these functions, and new regularization techniques based on order-invariant and diagonal structures
  • Our method outperforms state-of-the-art post-hoc calibration methods, namely temperature scaling and Dirichlet calibration, in multiple settings
  • The experimental results show the importance of learning within the intra order-preserving family as well as support the effectiveness of the proposed regularization in calibrating multiple classifiers on various datasets
Results
  • Note that learning intra orderpreserving functions without order-invariant and diagonal assumptions (i.e OP) does not exhibit a good performance
  • This result highlights being intra order-preserving can still be too general, and extra proper regularization on this family needs to be imposed.
  • Dir-ODIR and MS-ODIR were able to maintain the accuracy of the original models on these datasets (Kull et al, 2019), there is no guarantee that a linear transformation maintains the accuracy in general
  • This specially becomes harder when the number of classes grows as the authors will explore in the ImageNet experiment
Conclusion
  • The authors introduce the family of intra orderpreserving functions which retain the top-k predictions of any deep network when used as the post-hoc calibration function.
  • The authors propose a new neural network architecture to represent these functions, and new regularization techniques based on order-invariant and diagonal structures.
  • Calibrating neural network with this new family of functions generalizes many existing calibration techniques, with additional flexibility to express the post-hoc calibration function.
  • The experimental results show the importance of learning within the intra order-preserving family as well as support the effectiveness of the proposed regularization in calibrating multiple classifiers on various datasets
Summary
  • Introduction:

    Deep neural networks have demonstrated impressive accuracy in classification tasks, such as image recognition (He et al, 2016; Ren et al, 2015) and medical research (Jiang et al, 2012; Caruana et al, 2015)
  • These exciting results have recently motivated engineers to adopt deep networks as default components in building decision systems; for example, a multi-class neural network can be treated as a probabilistic predictor and its softmax output can provide the confidence scores of different actions for the downstream decision making pipeline (Girshick, 2015; Cao et al, 2017; Mozafari et al, 2019).
  • Objectives:

    The authors aim to learn general post-hoc calibration functions that can preserve the top-k predictions of any deep network.
  • Given φo and Dc, the goal is to learn a post-hoc calibration function f : Rn → Rn such that the new probabilistic predictor φ := sm ◦ f ◦ g is better calibrated and keeps the accuracy of the original network φo
  • Results:

    Note that learning intra orderpreserving functions without order-invariant and diagonal assumptions (i.e OP) does not exhibit a good performance
  • This result highlights being intra order-preserving can still be too general, and extra proper regularization on this family needs to be imposed.
  • Dir-ODIR and MS-ODIR were able to maintain the accuracy of the original models on these datasets (Kull et al, 2019), there is no guarantee that a linear transformation maintains the accuracy in general
  • This specially becomes harder when the number of classes grows as the authors will explore in the ImageNet experiment
  • Conclusion:

    The authors introduce the family of intra orderpreserving functions which retain the top-k predictions of any deep network when used as the post-hoc calibration function.
  • The authors propose a new neural network architecture to represent these functions, and new regularization techniques based on order-invariant and diagonal structures.
  • Calibrating neural network with this new family of functions generalizes many existing calibration techniques, with additional flexibility to express the post-hoc calibration function.
  • The experimental results show the importance of learning within the intra order-preserving family as well as support the effectiveness of the proposed regularization in calibrating multiple classifiers on various datasets
Tables
  • Table1: Statistics of the Evaluation Datasets
  • Table2: ECE (with M = 15 bins) on various image classification datasets and models with different calibration methods. The subscript numbers represent the rank of the corresponding method on the given model/dataset. The accuracy of the uncalibrated model is shown in parentheses. The number in parentheses in Dir-ODIR and MS-ODIR methods show the relative change in accuracy for each method
  • Table3: Calibration results for pretrained ResNet 152 on ImageNet
  • Table4: Calibration results for pretrained DenseNet 161 on ImageNet
Download tables as Excel
Related work
  • Many different post-hoc calibration methods have been studied in the literature (Platt et al, 1999; Guo et al, 2017; Kull et al, 2019; 2017b;a). Their main difference is in the parametric family of the calibration function. In Platt scaling (Platt et al, 1999), scale and shift parameters a, b ∈ R %&'( 1(0) 41(0) 5(0) Order-invariant? No )* +,-),- %&'(./ Yes
Funding
  • Our method outperforms state-of-the-art post-hoc calibration methods, namely temperature scaling and Dirichlet calibration, in multiple settings
Reference
  • Brier, G. W. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
    Google ScholarLocate open access versionFindings
  • Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., and Elhadad, N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1721–1730, 2015.
    Google ScholarLocate open access versionFindings
  • Clenshaw, C. W. and Curtis, A. R. A method for numerical integration on an automatic computer. Numerische Mathematik, 2(1):197–205, 1960.
    Google ScholarLocate open access versionFindings
  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255.
    Google ScholarLocate open access versionFindings
  • Facchinei, F. and Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems. Springer Science & Business Media, 2007.
    Google ScholarFindings
  • Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
    Google ScholarLocate open access versionFindings
  • Girshick, R. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
    Google ScholarLocate open access versionFindings
  • Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1321–1330. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
    Google ScholarLocate open access versionFindings
  • Jiang, X., Osl, M., Kim, J., and Ohno-Machado, L. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association, 19(2):263–274, 2012.
    Google ScholarLocate open access versionFindings
  • Kendall, A. and Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pp. 5574–5584, 2017.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
    Google ScholarFindings
  • Kull, M., Silva Filho, T., and Flach, P. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Artificial Intelligence and Statistics, pp. 623–631, 2017a.
    Google ScholarLocate open access versionFindings
  • Kull, M., Silva Filho, T. M., Flach, P., et al. Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics, 11(2):5052–5080, 2017b.
    Google ScholarLocate open access versionFindings
  • Kull, M., Nieto, M. P., Kangsepp, M., Silva Filho, T., Song, H., and Flach, P. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems, pp. 12295–12305, 2019.
    Google ScholarLocate open access versionFindings
  • Kumar, A., Sarawagi, S., and Jain, U. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pp. 2805–2814, 2018.
    Google ScholarLocate open access versionFindings
  • Liu, D. C. and Nocedal, J. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
    Google ScholarLocate open access versionFindings
  • Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pp. 13132–13143, 2019.
    Google ScholarLocate open access versionFindings
  • Mozafari, A. S., Gomes, H. S., Leao, W., and Gagne, C. Unsupervised temperature scaling: Post-processing unsupervised calibration of deep models decisions. 2019.
    Google ScholarFindings
  • Muller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Advances in Neural Information Processing Systems, pp. 4696–4705, 2019.
    Google ScholarLocate open access versionFindings
  • Naeini, M. P., Cooper, G., and Hauskrecht, M. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. 2011.
    Google ScholarFindings
  • Nixon, J., Dusenberry, M., Zhang, L., Jerfel, G., and Tran, D. Measuring calibration in deep learning. arXiv preprint arXiv:1904.01685, 2019.
    Findings
  • Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
    Findings
  • Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
    Google ScholarLocate open access versionFindings
  • Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • Seo, S., Seo, P. H., and Han, B. Learning for single-shot confidence calibration in deep neural networks through stochastic inferences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9030–9038, 2019.
    Google ScholarLocate open access versionFindings
  • Thulasidasan, S., Chennupati, G., Bilmes, J. A., Bhattacharya, T., and Michalak, S. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In Advances in Neural Information Processing Systems, pp. 13888–13899, 2019.
    Google ScholarLocate open access versionFindings
  • Wehenkel, A. and Louppe, G. Unconstrained monotonic neural networks. In Advances in Neural Information Processing Systems, pp. 1543–1553, 2019.
    Google ScholarLocate open access versionFindings
  • Xing, C., Arik, S., Zhang, Z., and Pfister, T. Distancebased learning from errors for confidence calibration. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6023– 6032, 2019.
    Google ScholarLocate open access versionFindings
  • Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699, 2002.
    Google ScholarLocate open access versionFindings
  • Zagoruyko, S. and Komodakis, N. Wide residual networks. In BMVC, 2016.
    Google ScholarLocate open access versionFindings
  • Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
    Google ScholarLocate open access versionFindings
  • The above example shows that f may not be differentiable for tied inputs. On the other hand, it is straightforward to see function f is differentiable at points where there is no tie. More precisely, for the points with tie in the input vector, we show the function f is B-differentiable, which is a weaker condition than the usual (Frechet) differentiability. Definition 7. (Facchinei & Pang, 2007) A function f: Rn → Rm is said to be B(ouligand)-differentiable at a point x ∈ Rn, if f is Lipschitz continuous in the neighborhood of x and directionally differentiable at x. Proposition 3. For f: Rn → Rn in Theorem 1, let w(x) be as defined in Corollary 4. If σ and m are continuously differentiable, then f is B-differentiable on Rn.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments