# All your loss are belong to Bayes

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Loss functions are a cornerstone of machine learning and the starting point of most algorithms. Statistics and Bayesian decision theory have contributed, via properness, to elicit over the past decades a wide set of admissible losses in supervised learning, to which most popular choices belong (logistic, square, Matsushita, etc.). Rathe...More

Introduction

- The loss function is a cornerstone of supervised learning. A rich literature on admissible losses has been developed from the early seventies in statistical decision theory [Sav71], and still earlier in foundational philosophical work [dF49].
- A significant body of work has focused on eliciting the set of admissible losses, yet in comparison with the vivid breakthroughs on models that has flourished during the past decade in machine learning, the decades-long picture of the loss resembles a still life — more often than not, it is fixed from the start, e.g. by assuming the popular logistic or square loss, or by assuming a restricted parametric form [Cza97, CM00, NDF00, CR02].
- More recent work has aimed to provide machine learning with greater flexibility in the loss [HT92, KS09, KKSK11, NM20] — yet these works face significant technical challenges arising from (i) the joint estimation non-parametric loss function along with the remainder of the model, and (ii) the specific part of a proper loss which is learned, called a link function, which relates class probability estimation to real-valued prediction [RW10]

Highlights

- The loss function is a cornerstone of supervised learning
- A significant body of work has focused on eliciting the set of admissible losses, yet in comparison with the vivid breakthroughs on models that has flourished during the past decade in machine learning, the decades-long picture of the loss resembles a still life — more often than not, it is fixed from the start, e.g. by assuming the popular logistic or square loss, or by assuming a restricted parametric form [Cza97, CM00, NDF00, CR02]
- (Proof in Appendix C.) In the context of Bayesian inference this means that the prior uncertainty in ν induces a prior on losses whose expected loss upper bounds that of the canonical link. We exploit this property to initialise our Bayesian inference scheme with fixed canonical link. This property follows from Theorem 4 if the Integrated Squared Gaussian Process (ISGP) is unbiased, which we show is trivial to guarantee
- We have introduced a Bayesian approach to inferring a posterior distribution over loss functions for supervised learning that complies with the Bayesian notion of properness
- Our contribution thereby advances the seminal work of [KKSK11] and the more recent [NM20] in terms of modelling flexibility and — as a direct consequence — practical effectiveness as evidenced by our state of the art performance. Our model is both highly general, and yet capable of out-performing even the most classic baseline, the logistic loss for binary classification. This represents an interesting step toward more flexible modelling in a wider machine learning context, which typically works with a loss function which is prescribed a priori
- Since the tricks we use essentially rely on the loss being expressible as a Bregman divergence and since Bregman divergences are a principled distortion measure for unsupervised learning — such as in the context of the popular k-means and Expectation Maximisation (EM) algorithms — an interesting avenue for future work is to investigate the potential of our approach for unsupervised learning

Methods

- The authors provide illustrative examples and quantitative comparisons of the ISGP prior in univariate regression/classification, and inferring proper losses.

Results

- Experiments show that the approach tends to significantly beat the state of the art [KS09, KKSK11, NM20], and records better results than baselines informed with specific losses or links, which to the knowledge is a first among approaches learning a loss or link.

Conclusion

- The authors have introduced a Bayesian approach to inferring a posterior distribution over loss functions for supervised learning that complies with the Bayesian notion of properness.
- The authors' contribution thereby advances the seminal work of [KKSK11] and the more recent [NM20] in terms of modelling flexibility and — as a direct consequence — practical effectiveness as evidenced by the state of the art performance.
- The authors' model is both highly general, and yet capable of out-performing even the most classic baseline, the logistic loss for binary classification.
- Since the tricks the authors use essentially rely on the loss being expressible as a Bregman divergence and since Bregman divergences are a principled distortion measure for unsupervised learning — such as in the context of the popular k-means and EM algorithms — an interesting avenue for future work is to investigate the potential of the approach for unsupervised learning

Summary

## Introduction:

The loss function is a cornerstone of supervised learning. A rich literature on admissible losses has been developed from the early seventies in statistical decision theory [Sav71], and still earlier in foundational philosophical work [dF49].- A significant body of work has focused on eliciting the set of admissible losses, yet in comparison with the vivid breakthroughs on models that has flourished during the past decade in machine learning, the decades-long picture of the loss resembles a still life — more often than not, it is fixed from the start, e.g. by assuming the popular logistic or square loss, or by assuming a restricted parametric form [Cza97, CM00, NDF00, CR02].
- More recent work has aimed to provide machine learning with greater flexibility in the loss [HT92, KS09, KKSK11, NM20] — yet these works face significant technical challenges arising from (i) the joint estimation non-parametric loss function along with the remainder of the model, and (ii) the specific part of a proper loss which is learned, called a link function, which relates class probability estimation to real-valued prediction [RW10]
## Methods:

The authors provide illustrative examples and quantitative comparisons of the ISGP prior in univariate regression/classification, and inferring proper losses.## Results:

Experiments show that the approach tends to significantly beat the state of the art [KS09, KKSK11, NM20], and records better results than baselines informed with specific losses or links, which to the knowledge is a first among approaches learning a loss or link.## Conclusion:

The authors have introduced a Bayesian approach to inferring a posterior distribution over loss functions for supervised learning that complies with the Bayesian notion of properness.- The authors' contribution thereby advances the seminal work of [KKSK11] and the more recent [NM20] in terms of modelling flexibility and — as a direct consequence — practical effectiveness as evidenced by the state of the art performance.
- The authors' model is both highly general, and yet capable of out-performing even the most classic baseline, the logistic loss for binary classification.
- Since the tricks the authors use essentially rely on the loss being expressible as a Bregman divergence and since Bregman divergences are a principled distortion measure for unsupervised learning — such as in the context of the popular k-means and EM algorithms — an interesting avenue for future work is to investigate the potential of the approach for unsupervised learning

- Table1: As expected the Bayesian GP and ISGP models — which have the advantage of making stronger regularity assumptions — perform relatively better with less data (the Small case). Our ISGP is in turn superior to the GP in that case, in line with the. Mean test set negative log likelihoods for various isotonic regression methods. See text for details
- Table2: Test AUC for generalised linear models with various link methods (ordering in decreasing average). See text for details

Funding

- Experiments show that our approach tends to significantly beat the state of the art [KS09, KKSK11, NM20], and records better results than baselines informed with specific losses or links, which to our knowledge is a first among approaches learning a loss or link

Study subjects and analysis

MNIST-like datasets: 3

Moreover, the monotonic ISGP-Linkgistic slightly outperforms GP-Linkgistic, and as far as we know records the first result beating logistic regression on this problem, by a reasonable margin on fmnist [NM20]. We further bench-marked ISGP-Linkgistic against GP-Linkgistic and logistic regression (as the latter was the strongest practical algorithm in the experiments of [NM20]) on a broader set of tasks, namely the three MNIST-like datasets of [LC10, XRV17, CBIK+18]. We found that ISGP-Linkgistic dominates on all three datasets, as the training set size increases — see Figure 2 and the caption therein for more details

datasets: 3

We further bench-marked ISGP-Linkgistic against GP-Linkgistic and logistic regression (as the latter was the strongest practical algorithm in the experiments of [NM20]) on a broader set of tasks, namely the three MNIST-like datasets of [LC10, XRV17, CBIK+18]. We found that ISGP-Linkgistic dominates on all three datasets, as the training set size increases — see Figure 2 and the caption therein for more details. Figure 9 depicts an example of the learned (inverse) link functions

Reference

- Miriam Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and Edward Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, (4), 1955.
- S.-I. Amari. New developments of information geometry (17): Tsallis q-entropy, escort geometry, conformal geometry. In Mathematical Sciences (suurikagaku), number 592, pages 73–8Science Company, October 2012. in japanese.
- S.-I. Amari. New developments of information geometry (26): Information geometry of convex programming and game theory. In Mathematical Sciences (suurikagaku), number 605, pages 65–74. Science Company, November 201in japanese.
- S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.
- [ANARL19] Clement Abi Nader, Nicholas Ayache, Philippe Robert, and Marco Lorenzi. Monotonic Gaussian Process for Spatio-Temporal Disease Progression Modeling in Brain Imaging Data. NeuroImage, 2019.
- Pankaj K. Agarwal, Jeff M. Phillips, and Bardia Sadri. Lipschitz unimodal and isotonic regression on paths and trees. In Proc. of the 9th Latin American Symposium on Theoretical Informatics, pages 384–396, 2010.
- Francis Bach. Efficient algorithms for non-convex isotonic regression through submodular optimization. In Advances in Neural Information Processing Systems 31. 2018.
- [BGW05] A. Banerjee, X. Guo, and H. Wang. On the optimality of conditional expectation as a bregman predictor. IEEE Trans. IT, 51:2664–2669, 2005.
- [BJM06] P. Bartlett, M. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. J. of the Am. Stat. Assoc., 101:138–156, 2006.
- Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal of Scientific Computation, 1995.
- L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200–217, 1967.
- A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation ans classification: structure and applications, 2005. Technical Report, University of Pennsylvania.
- Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
- [CBG+17] M. Cissé, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: improving robustness to adversarial examples. In 34th ICML, 2017.
- [CBIK+18] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature, 2018.
- Claudia Czado and Axel Munk. Noncanonical links in generalized linear models - when is the effort justified? Journal of Statistical Planning and Inference, 87, 2000.
- Claudia Czado and Adrian Raftery. Choosing the link function and accounting for link uncertainty in generalized linear models using bayes factors. Statistical Papers, 47, 2002.
- Claudia Czado. On selecting parametric link transformation families in generalized linear models. Journal of Statistical Planning and Inference, 61:125–139, 05 1997.
- Philosophy of Sciences, pages 67–82, 1949.
- [DG17] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
- S. Flaxman, Y.W. Teh, and D. Sejdinovic. Poisson Intensity Estimation with Reproducing Kernels. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
- [GBCC15] Shirin Golchi, D. Bingham, H. Chipman, and David Campbell. Monotone emulation of computer experiments. SIAM/ASA Journal on Uncertainty Quantification, 3:370–392, 01 2015.
- [HHLK19] Pashupati Hegde, Markus Heinonen, Harri Lähdesmäki, and Samuel Kaski. Deep learning with differential gaussian process flows. In Proceedings of Machine Learning Research, volume 89, pages 1812–1821, 16–18 Apr 2019.
- W.K. Hardle and Berwin Turlach. Nonparametric approaches to generalized linear models. 02 1992.
- Ieva Kazlauskaite, Carl Henrik Ek, and Neill D. F. Campbell. Gaussian process latent variable alignment learning. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, 2019.
- [KKSK11] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems 24. 2011.
- M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. J. Comp. Syst. Sc., 58:109–128, 1999.
- J.B. Kruskal. Nonmetric multidimensional scaling: A numerical method. Psychometrika, 1964.
- Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.
- Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
- Chris Lloyd, Tom Gunter, Michael Osborne, and Stephen Roberts. Variational inference for gaussian process modulated poisson processes. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1814–1822, Lille, France, 2015. PMLR.
- Siqi Liu and Milos Hauskrecht. Nonparametric regressive point processes based on conditional gaussian processes. In Advances in Neural Information Processing Systems 2019.
- Cong Han Lim. An efficient pruning algorithm for robust isotonic regression. In Advances in Neural Information Processing Systems 31. 2018.
- Ronny Luss and Saharon Rosset. Generalized isotonic regression. Journal of Computational and Graphical Statistics, 23, 2014.
- [LRG+17] Cheng Li, Santu Rana, Satyandra K. Gupta, Vu Nguyen, and Svetha Venkatesh. Bayesian optimization with monotonicity information. In NIPS Workshop on Bayesian Optimization, 2017.
- Peter McCullagh and Jesper Møller. The permanental process. Advances in Applied Probability, 38(4), 2006.
- Arak Mathai and Serge Provost. Quadratic Forms in Random Variables: Theory and Applications. Marcel Dekker, Inc., 1992.
- [MXZ06] Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. J. Mach. Learn. Res., 7:2651–2667, 2006.
- Ioannis Ntzoufras, Petros Dellaportas, and Jonathan Forster. Bayesian variable and link determination for generalised linear models. Journal of Statistical Planning and Inference, 111, 03 2000.
- Richard Nock and Aditya Krishna Menon. Supervised learning: No loss no cry. In ICML’20, 2020.
- Richard Nock and Frank Nielsen. On the efficient minimization of classificationcalibrated surrogates. In NIPS*21, pages 1201–1208, 2008.
- [NNA16] Richard Nock, Frank Nielsen, and Shun-ichi Amari. On conformal divergences and their population minimizers. IEEE Trans. IT, 62:1–12, 2016.
- Evert Johannes Nyström. Über die praktische auflösung von linearen integralgleichungen mit anwendungen auf randwertaufgaben der potentialtheorie. Commentationes physicomathematicae, 4:1–52, 1928.
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
- Jaakko Riihimäki and Aki Vehtari. Gaussian processes with monotonicity information. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
- Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005.
- Mark D. Reid and Robert C. Williamson. Composite binary losses. Journal of Machine Learning Research, 11:2387–2422, 2010.
- L.-J. Savage. Elicitation of personal probabilities and expectations. J. of the Am. Stat. Assoc., 66:783–801, 1971.
- M.-J. Schervish. A general method for comparing probability assessors. Ann. of Stat., 17(4):1856–1879, 1989.
- Hans J. Skaug and David A. Fournier. Automatic approximation of the marginal likelihood in non-gaussian hierarchical models. Computational Statistical Data Analysis, 51, 2006.
- Eero Siivola, Juho Piironen, and Aki Vehtari. Automatic monotonicity detection for gaussian processes. In arXiv 1610.05440, 2016.
- Tomoyuki Shirai and Yoichiro Takahashi. Random point fields associated with certain fredholm determinants ii: Fermion shifts and their ergodic and gibbs properties. The Annals of Probability, (3), 07 2003.
- M. Telgarsky. Boosting with the logistic loss is consistent. In 26 th COLT, pages 911–965, 2013.
- [UKEC19] Ivan Ustyuzhaninov, Ieva Kazlauskaite, Carl Henrik Ek, and Neill D. F. Campbell. Monotonic gaussian process flow. In arXiv 1905.12930, 2019.
- Christian J. Walder and Adrian N. Bishop. Fast bayesian intensity estimation for the permanental process. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 2017.
- Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
- Ang Yang, Cheng Li, Santu Rana, Sunil Gupta, and Svetha Venkatesh. Sparse approximation for gaussian process with derivative observations. In AI 2018: Advances in Artificial Intelligence, 2018.
- L Yeganova and W Wilbur. Isotonic regression under lipschitz constraint. Journal of Optimization Theory and Applications, 2009.
- Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Knowledge Discovery and Data Mining, 2002.
- J. Zhang. Divergence function, duality, and convex analysis. Neural Computation, 16:159–195, 2004.

Tags

Comments