# Intra Order-preserving Functions for Calibration of Multi-Class Neural Networks

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Predicting calibrated confidence scores for multi-class deep networks is important for avoiding rare but costly mistakes. A common approach is to learn a post-hoc calibration function that transforms the output of the original network into calibrated confidence scores while maintaining the network's accuracy. However, previous post-hoc ca...More

Code:

Data:

Introduction

- Deep neural networks have demonstrated impressive accuracy in classification tasks, such as image recognition (He et al, 2016; Ren et al, 2015) and medical research (Jiang et al, 2012; Caruana et al, 2015)
- These exciting results have recently motivated engineers to adopt deep networks as default components in building decision systems; for example, a multi-class neural network can be treated as a probabilistic predictor and its softmax output can provide the confidence scores of different actions for the downstream decision making pipeline (Girshick, 2015; Cao et al, 2017; Mozafari et al, 2019).

Highlights

- Deep neural networks have demonstrated impressive accuracy in classification tasks, such as image recognition (He et al, 2016; Ren et al, 2015) and medical research (Jiang et al, 2012; Caruana et al, 2015)
- As default components in building decision systems; for example, a multi-class neural network can be treated as a probabilistic predictor and its softmax output can provide the confidence scores of different actions for the downstream decision making pipeline (Girshick, 2015; Cao et al, 2017; Mozafari et al, 2019)
- We identify necessary and sufficient conditions for describing intra order-preserving functions, and propose a novel neural network architecture that can represent complex intra order-preserving function through common neural network components
- We introduce the family of intra orderpreserving functions which retain the top-k predictions of any deep network when used as the post-hoc calibration function
- We propose a new neural network architecture to represent these functions, and new regularization techniques based on order-invariant and diagonal structures
- Our method outperforms state-of-the-art post-hoc calibration methods, namely temperature scaling and Dirichlet calibration, in multiple settings
- The experimental results show the importance of learning within the intra order-preserving family as well as support the effectiveness of the proposed regularization in calibrating multiple classifiers on various datasets

Results

- Note that learning intra orderpreserving functions without order-invariant and diagonal assumptions (i.e OP) does not exhibit a good performance
- This result highlights being intra order-preserving can still be too general, and extra proper regularization on this family needs to be imposed.
- Dir-ODIR and MS-ODIR were able to maintain the accuracy of the original models on these datasets (Kull et al, 2019), there is no guarantee that a linear transformation maintains the accuracy in general
- This specially becomes harder when the number of classes grows as the authors will explore in the ImageNet experiment

Conclusion

- The authors introduce the family of intra orderpreserving functions which retain the top-k predictions of any deep network when used as the post-hoc calibration function.
- The authors propose a new neural network architecture to represent these functions, and new regularization techniques based on order-invariant and diagonal structures.
- Calibrating neural network with this new family of functions generalizes many existing calibration techniques, with additional flexibility to express the post-hoc calibration function.
- The experimental results show the importance of learning within the intra order-preserving family as well as support the effectiveness of the proposed regularization in calibrating multiple classifiers on various datasets

Summary

## Introduction:

Deep neural networks have demonstrated impressive accuracy in classification tasks, such as image recognition (He et al, 2016; Ren et al, 2015) and medical research (Jiang et al, 2012; Caruana et al, 2015)- These exciting results have recently motivated engineers to adopt deep networks as default components in building decision systems; for example, a multi-class neural network can be treated as a probabilistic predictor and its softmax output can provide the confidence scores of different actions for the downstream decision making pipeline (Girshick, 2015; Cao et al, 2017; Mozafari et al, 2019).
## Objectives:

The authors aim to learn general post-hoc calibration functions that can preserve the top-k predictions of any deep network.- Given φo and Dc, the goal is to learn a post-hoc calibration function f : Rn → Rn such that the new probabilistic predictor φ := sm ◦ f ◦ g is better calibrated and keeps the accuracy of the original network φo
## Results:

Note that learning intra orderpreserving functions without order-invariant and diagonal assumptions (i.e OP) does not exhibit a good performance- This result highlights being intra order-preserving can still be too general, and extra proper regularization on this family needs to be imposed.
- Dir-ODIR and MS-ODIR were able to maintain the accuracy of the original models on these datasets (Kull et al, 2019), there is no guarantee that a linear transformation maintains the accuracy in general
- This specially becomes harder when the number of classes grows as the authors will explore in the ImageNet experiment
## Conclusion:

The authors introduce the family of intra orderpreserving functions which retain the top-k predictions of any deep network when used as the post-hoc calibration function.- The authors propose a new neural network architecture to represent these functions, and new regularization techniques based on order-invariant and diagonal structures.
- Calibrating neural network with this new family of functions generalizes many existing calibration techniques, with additional flexibility to express the post-hoc calibration function.
- The experimental results show the importance of learning within the intra order-preserving family as well as support the effectiveness of the proposed regularization in calibrating multiple classifiers on various datasets

- Table1: Statistics of the Evaluation Datasets
- Table2: ECE (with M = 15 bins) on various image classification datasets and models with different calibration methods. The subscript numbers represent the rank of the corresponding method on the given model/dataset. The accuracy of the uncalibrated model is shown in parentheses. The number in parentheses in Dir-ODIR and MS-ODIR methods show the relative change in accuracy for each method
- Table3: Calibration results for pretrained ResNet 152 on ImageNet
- Table4: Calibration results for pretrained DenseNet 161 on ImageNet

Related work

- Many different post-hoc calibration methods have been studied in the literature (Platt et al, 1999; Guo et al, 2017; Kull et al, 2019; 2017b;a). Their main difference is in the parametric family of the calibration function. In Platt scaling (Platt et al, 1999), scale and shift parameters a, b ∈ R %&'( 1(0) 41(0) 5(0) Order-invariant? No )* +,-),- %&'(./ Yes

Funding

- Our method outperforms state-of-the-art post-hoc calibration methods, namely temperature scaling and Dirichlet calibration, in multiple settings

Reference

- Brier, G. W. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
- Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
- Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., and Elhadad, N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1721–1730, 2015.
- Clenshaw, C. W. and Curtis, A. R. A method for numerical integration on an automatic computer. Numerische Mathematik, 2(1):197–205, 1960.
- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255.
- Facchinei, F. and Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems. Springer Science & Business Media, 2007.
- Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
- Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1321–1330. JMLR. org, 2017.
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
- Jiang, X., Osl, M., Kim, J., and Ohno-Machado, L. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association, 19(2):263–274, 2012.
- Kendall, A. and Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pp. 5574–5584, 2017.
- Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2014.
- Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
- Kull, M., Silva Filho, T., and Flach, P. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Artificial Intelligence and Statistics, pp. 623–631, 2017a.
- Kull, M., Silva Filho, T. M., Flach, P., et al. Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics, 11(2):5052–5080, 2017b.
- Kull, M., Nieto, M. P., Kangsepp, M., Silva Filho, T., Song, H., and Flach, P. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems, pp. 12295–12305, 2019.
- Kumar, A., Sarawagi, S., and Jain, U. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pp. 2805–2814, 2018.
- Liu, D. C. and Nocedal, J. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
- Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pp. 13132–13143, 2019.
- Mozafari, A. S., Gomes, H. S., Leao, W., and Gagne, C. Unsupervised temperature scaling: Post-processing unsupervised calibration of deep models decisions. 2019.
- Muller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Advances in Neural Information Processing Systems, pp. 4696–4705, 2019.
- Naeini, M. P., Cooper, G., and Hauskrecht, M. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. 2011.
- Nixon, J., Dusenberry, M., Zhang, L., Jerfel, G., and Tran, D. Measuring calibration in deep learning. arXiv preprint arXiv:1904.01685, 2019.
- Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
- Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
- Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
- Seo, S., Seo, P. H., and Han, B. Learning for single-shot confidence calibration in deep neural networks through stochastic inferences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9030–9038, 2019.
- Thulasidasan, S., Chennupati, G., Bilmes, J. A., Bhattacharya, T., and Michalak, S. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In Advances in Neural Information Processing Systems, pp. 13888–13899, 2019.
- Wehenkel, A. and Louppe, G. Unconstrained monotonic neural networks. In Advances in Neural Information Processing Systems, pp. 1543–1553, 2019.
- Xing, C., Arik, S., Zhang, Z., and Pfister, T. Distancebased learning from errors for confidence calibration. In International Conference on Learning Representations, 2020.
- Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6023– 6032, 2019.
- Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699, 2002.
- Zagoruyko, S. and Komodakis, N. Wide residual networks. In BMVC, 2016.
- Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
- The above example shows that f may not be differentiable for tied inputs. On the other hand, it is straightforward to see function f is differentiable at points where there is no tie. More precisely, for the points with tie in the input vector, we show the function f is B-differentiable, which is a weaker condition than the usual (Frechet) differentiability. Definition 7. (Facchinei & Pang, 2007) A function f: Rn → Rm is said to be B(ouligand)-differentiable at a point x ∈ Rn, if f is Lipschitz continuous in the neighborhood of x and directionally differentiable at x. Proposition 3. For f: Rn → Rn in Theorem 1, let w(x) be as defined in Corollary 4. If σ and m are continuously differentiable, then f is B-differentiable on Rn.

Full Text

Tags

Comments