All your loss are belong to Bayes

NIPS 2020, 2020.

Cited by: 0|Views10
EI
Weibo:
We introduce a trick on squared Gaussian Processes to obtain a random process whose paths are compliant source functions with many desirable properties in the context of link estimation

Abstract:

Loss functions are a cornerstone of machine learning and the starting point of most algorithms. Statistics and Bayesian decision theory have contributed, via properness, to elicit over the past decades a wide set of admissible losses in supervised learning, to which most popular choices belong (logistic, square, Matsushita, etc.). Rathe...More
0
Full Text
Bibtex
Weibo
Introduction
  • The loss function is a cornerstone of supervised learning. A rich literature on admissible losses has been developed from the early seventies in statistical decision theory [Sav71], and still earlier in foundational philosophical work [dF49].
  • A significant body of work has focused on eliciting the set of admissible losses, yet in comparison with the vivid breakthroughs on models that has flourished during the past decade in machine learning, the decades-long picture of the loss resembles a still life — more often than not, it is fixed from the start, e.g. by assuming the popular logistic or square loss, or by assuming a restricted parametric form [Cza97, CM00, NDF00, CR02].
  • More recent work has aimed to provide machine learning with greater flexibility in the loss [HT92, KS09, KKSK11, NM20] — yet these works face significant technical challenges arising from (i) the joint estimation non-parametric loss function along with the remainder of the model, and (ii) the specific part of a proper loss which is learned, called a link function, which relates class probability estimation to real-valued prediction [RW10]
Highlights
  • The loss function is a cornerstone of supervised learning
  • A significant body of work has focused on eliciting the set of admissible losses, yet in comparison with the vivid breakthroughs on models that has flourished during the past decade in machine learning, the decades-long picture of the loss resembles a still life — more often than not, it is fixed from the start, e.g. by assuming the popular logistic or square loss, or by assuming a restricted parametric form [Cza97, CM00, NDF00, CR02]
  • (Proof in Appendix C.) In the context of Bayesian inference this means that the prior uncertainty in ν induces a prior on losses whose expected loss upper bounds that of the canonical link. We exploit this property to initialise our Bayesian inference scheme with fixed canonical link. This property follows from Theorem 4 if the Integrated Squared Gaussian Process (ISGP) is unbiased, which we show is trivial to guarantee
  • We have introduced a Bayesian approach to inferring a posterior distribution over loss functions for supervised learning that complies with the Bayesian notion of properness
  • Our contribution thereby advances the seminal work of [KKSK11] and the more recent [NM20] in terms of modelling flexibility and — as a direct consequence — practical effectiveness as evidenced by our state of the art performance. Our model is both highly general, and yet capable of out-performing even the most classic baseline, the logistic loss for binary classification. This represents an interesting step toward more flexible modelling in a wider machine learning context, which typically works with a loss function which is prescribed a priori
  • Since the tricks we use essentially rely on the loss being expressible as a Bregman divergence and since Bregman divergences are a principled distortion measure for unsupervised learning — such as in the context of the popular k-means and Expectation Maximisation (EM) algorithms — an interesting avenue for future work is to investigate the potential of our approach for unsupervised learning
Methods
  • The authors provide illustrative examples and quantitative comparisons of the ISGP prior in univariate regression/classification, and inferring proper losses.
Results
  • Experiments show that the approach tends to significantly beat the state of the art [KS09, KKSK11, NM20], and records better results than baselines informed with specific losses or links, which to the knowledge is a first among approaches learning a loss or link.
Conclusion
  • The authors have introduced a Bayesian approach to inferring a posterior distribution over loss functions for supervised learning that complies with the Bayesian notion of properness.
  • The authors' contribution thereby advances the seminal work of [KKSK11] and the more recent [NM20] in terms of modelling flexibility and — as a direct consequence — practical effectiveness as evidenced by the state of the art performance.
  • The authors' model is both highly general, and yet capable of out-performing even the most classic baseline, the logistic loss for binary classification.
  • Since the tricks the authors use essentially rely on the loss being expressible as a Bregman divergence and since Bregman divergences are a principled distortion measure for unsupervised learning — such as in the context of the popular k-means and EM algorithms — an interesting avenue for future work is to investigate the potential of the approach for unsupervised learning
Summary
  • Introduction:

    The loss function is a cornerstone of supervised learning. A rich literature on admissible losses has been developed from the early seventies in statistical decision theory [Sav71], and still earlier in foundational philosophical work [dF49].
  • A significant body of work has focused on eliciting the set of admissible losses, yet in comparison with the vivid breakthroughs on models that has flourished during the past decade in machine learning, the decades-long picture of the loss resembles a still life — more often than not, it is fixed from the start, e.g. by assuming the popular logistic or square loss, or by assuming a restricted parametric form [Cza97, CM00, NDF00, CR02].
  • More recent work has aimed to provide machine learning with greater flexibility in the loss [HT92, KS09, KKSK11, NM20] — yet these works face significant technical challenges arising from (i) the joint estimation non-parametric loss function along with the remainder of the model, and (ii) the specific part of a proper loss which is learned, called a link function, which relates class probability estimation to real-valued prediction [RW10]
  • Methods:

    The authors provide illustrative examples and quantitative comparisons of the ISGP prior in univariate regression/classification, and inferring proper losses.
  • Results:

    Experiments show that the approach tends to significantly beat the state of the art [KS09, KKSK11, NM20], and records better results than baselines informed with specific losses or links, which to the knowledge is a first among approaches learning a loss or link.
  • Conclusion:

    The authors have introduced a Bayesian approach to inferring a posterior distribution over loss functions for supervised learning that complies with the Bayesian notion of properness.
  • The authors' contribution thereby advances the seminal work of [KKSK11] and the more recent [NM20] in terms of modelling flexibility and — as a direct consequence — practical effectiveness as evidenced by the state of the art performance.
  • The authors' model is both highly general, and yet capable of out-performing even the most classic baseline, the logistic loss for binary classification.
  • Since the tricks the authors use essentially rely on the loss being expressible as a Bregman divergence and since Bregman divergences are a principled distortion measure for unsupervised learning — such as in the context of the popular k-means and EM algorithms — an interesting avenue for future work is to investigate the potential of the approach for unsupervised learning
Tables
  • Table1: As expected the Bayesian GP and ISGP models — which have the advantage of making stronger regularity assumptions — perform relatively better with less data (the Small case). Our ISGP is in turn superior to the GP in that case, in line with the. Mean test set negative log likelihoods for various isotonic regression methods. See text for details
  • Table2: Test AUC for generalised linear models with various link methods (ordering in decreasing average). See text for details
Download tables as Excel
Funding
  • Experiments show that our approach tends to significantly beat the state of the art [KS09, KKSK11, NM20], and records better results than baselines informed with specific losses or links, which to our knowledge is a first among approaches learning a loss or link
Study subjects and analysis
MNIST-like datasets: 3
Moreover, the monotonic ISGP-Linkgistic slightly outperforms GP-Linkgistic, and as far as we know records the first result beating logistic regression on this problem, by a reasonable margin on fmnist [NM20]. We further bench-marked ISGP-Linkgistic against GP-Linkgistic and logistic regression (as the latter was the strongest practical algorithm in the experiments of [NM20]) on a broader set of tasks, namely the three MNIST-like datasets of [LC10, XRV17, CBIK+18]. We found that ISGP-Linkgistic dominates on all three datasets, as the training set size increases — see Figure 2 and the caption therein for more details

datasets: 3
We further bench-marked ISGP-Linkgistic against GP-Linkgistic and logistic regression (as the latter was the strongest practical algorithm in the experiments of [NM20]) on a broader set of tasks, namely the three MNIST-like datasets of [LC10, XRV17, CBIK+18]. We found that ISGP-Linkgistic dominates on all three datasets, as the training set size increases — see Figure 2 and the caption therein for more details. Figure 9 depicts an example of the learned (inverse) link functions

Reference
  • Miriam Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and Edward Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, (4), 1955.
    Google ScholarLocate open access versionFindings
  • S.-I. Amari. New developments of information geometry (17): Tsallis q-entropy, escort geometry, conformal geometry. In Mathematical Sciences (suurikagaku), number 592, pages 73–8Science Company, October 2012. in japanese.
    Google ScholarLocate open access versionFindings
  • S.-I. Amari. New developments of information geometry (26): Information geometry of convex programming and game theory. In Mathematical Sciences (suurikagaku), number 605, pages 65–74. Science Company, November 201in japanese.
    Google ScholarLocate open access versionFindings
  • S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.
    Google ScholarFindings
  • [ANARL19] Clement Abi Nader, Nicholas Ayache, Philippe Robert, and Marco Lorenzi. Monotonic Gaussian Process for Spatio-Temporal Disease Progression Modeling in Brain Imaging Data. NeuroImage, 2019.
    Google ScholarLocate open access versionFindings
  • Pankaj K. Agarwal, Jeff M. Phillips, and Bardia Sadri. Lipschitz unimodal and isotonic regression on paths and trees. In Proc. of the 9th Latin American Symposium on Theoretical Informatics, pages 384–396, 2010.
    Google ScholarLocate open access versionFindings
  • Francis Bach. Efficient algorithms for non-convex isotonic regression through submodular optimization. In Advances in Neural Information Processing Systems 31. 2018.
    Google ScholarLocate open access versionFindings
  • [BGW05] A. Banerjee, X. Guo, and H. Wang. On the optimality of conditional expectation as a bregman predictor. IEEE Trans. IT, 51:2664–2669, 2005.
    Google ScholarLocate open access versionFindings
  • [BJM06] P. Bartlett, M. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. J. of the Am. Stat. Assoc., 101:138–156, 2006.
    Google ScholarLocate open access versionFindings
  • Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal of Scientific Computation, 1995.
    Google ScholarLocate open access versionFindings
  • L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200–217, 1967.
    Google ScholarLocate open access versionFindings
  • A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation ans classification: structure and applications, 2005. Technical Report, University of Pennsylvania.
    Google ScholarFindings
  • Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
    Google ScholarFindings
  • [CBG+17] M. Cissé, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: improving robustness to adversarial examples. In 34th ICML, 2017.
    Google ScholarLocate open access versionFindings
  • [CBIK+18] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature, 2018.
    Google ScholarFindings
  • Claudia Czado and Axel Munk. Noncanonical links in generalized linear models - when is the effort justified? Journal of Statistical Planning and Inference, 87, 2000.
    Google ScholarLocate open access versionFindings
  • Claudia Czado and Adrian Raftery. Choosing the link function and accounting for link uncertainty in generalized linear models using bayes factors. Statistical Papers, 47, 2002.
    Google ScholarLocate open access versionFindings
  • Claudia Czado. On selecting parametric link transformation families in generalized linear models. Journal of Statistical Planning and Inference, 61:125–139, 05 1997.
    Google ScholarLocate open access versionFindings
  • Philosophy of Sciences, pages 67–82, 1949.
    Google ScholarFindings
  • [DG17] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
    Google ScholarFindings
  • S. Flaxman, Y.W. Teh, and D. Sejdinovic. Poisson Intensity Estimation with Reproducing Kernels. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
    Google ScholarLocate open access versionFindings
  • [GBCC15] Shirin Golchi, D. Bingham, H. Chipman, and David Campbell. Monotone emulation of computer experiments. SIAM/ASA Journal on Uncertainty Quantification, 3:370–392, 01 2015.
    Google ScholarLocate open access versionFindings
  • [HHLK19] Pashupati Hegde, Markus Heinonen, Harri Lähdesmäki, and Samuel Kaski. Deep learning with differential gaussian process flows. In Proceedings of Machine Learning Research, volume 89, pages 1812–1821, 16–18 Apr 2019.
    Google ScholarLocate open access versionFindings
  • W.K. Hardle and Berwin Turlach. Nonparametric approaches to generalized linear models. 02 1992.
    Google ScholarFindings
  • Ieva Kazlauskaite, Carl Henrik Ek, and Neill D. F. Campbell. Gaussian process latent variable alignment learning. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, 2019.
    Google ScholarLocate open access versionFindings
  • [KKSK11] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems 24. 2011.
    Google ScholarLocate open access versionFindings
  • M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. J. Comp. Syst. Sc., 58:109–128, 1999.
    Google ScholarLocate open access versionFindings
  • J.B. Kruskal. Nonmetric multidimensional scaling: A numerical method. Psychometrika, 1964.
    Google ScholarLocate open access versionFindings
  • Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.
    Google ScholarLocate open access versionFindings
  • Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
    Google ScholarFindings
  • Chris Lloyd, Tom Gunter, Michael Osborne, and Stephen Roberts. Variational inference for gaussian process modulated poisson processes. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1814–1822, Lille, France, 2015. PMLR.
    Google ScholarLocate open access versionFindings
  • Siqi Liu and Milos Hauskrecht. Nonparametric regressive point processes based on conditional gaussian processes. In Advances in Neural Information Processing Systems 2019.
    Google ScholarLocate open access versionFindings
  • Cong Han Lim. An efficient pruning algorithm for robust isotonic regression. In Advances in Neural Information Processing Systems 31. 2018.
    Google ScholarLocate open access versionFindings
  • Ronny Luss and Saharon Rosset. Generalized isotonic regression. Journal of Computational and Graphical Statistics, 23, 2014.
    Google ScholarLocate open access versionFindings
  • [LRG+17] Cheng Li, Santu Rana, Satyandra K. Gupta, Vu Nguyen, and Svetha Venkatesh. Bayesian optimization with monotonicity information. In NIPS Workshop on Bayesian Optimization, 2017.
    Google ScholarLocate open access versionFindings
  • Peter McCullagh and Jesper Møller. The permanental process. Advances in Applied Probability, 38(4), 2006.
    Google ScholarLocate open access versionFindings
  • Arak Mathai and Serge Provost. Quadratic Forms in Random Variables: Theory and Applications. Marcel Dekker, Inc., 1992.
    Google ScholarFindings
  • [MXZ06] Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. J. Mach. Learn. Res., 7:2651–2667, 2006.
    Google ScholarLocate open access versionFindings
  • Ioannis Ntzoufras, Petros Dellaportas, and Jonathan Forster. Bayesian variable and link determination for generalised linear models. Journal of Statistical Planning and Inference, 111, 03 2000.
    Google ScholarLocate open access versionFindings
  • Richard Nock and Aditya Krishna Menon. Supervised learning: No loss no cry. In ICML’20, 2020.
    Google ScholarLocate open access versionFindings
  • Richard Nock and Frank Nielsen. On the efficient minimization of classificationcalibrated surrogates. In NIPS*21, pages 1201–1208, 2008.
    Google ScholarLocate open access versionFindings
  • [NNA16] Richard Nock, Frank Nielsen, and Shun-ichi Amari. On conformal divergences and their population minimizers. IEEE Trans. IT, 62:1–12, 2016.
    Google ScholarLocate open access versionFindings
  • Evert Johannes Nyström. Über die praktische auflösung von linearen integralgleichungen mit anwendungen auf randwertaufgaben der potentialtheorie. Commentationes physicomathematicae, 4:1–52, 1928.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • Jaakko Riihimäki and Aki Vehtari. Gaussian processes with monotonicity information. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
    Google ScholarLocate open access versionFindings
  • Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005.
    Google ScholarFindings
  • Mark D. Reid and Robert C. Williamson. Composite binary losses. Journal of Machine Learning Research, 11:2387–2422, 2010.
    Google ScholarLocate open access versionFindings
  • L.-J. Savage. Elicitation of personal probabilities and expectations. J. of the Am. Stat. Assoc., 66:783–801, 1971.
    Google ScholarLocate open access versionFindings
  • M.-J. Schervish. A general method for comparing probability assessors. Ann. of Stat., 17(4):1856–1879, 1989.
    Google ScholarLocate open access versionFindings
  • Hans J. Skaug and David A. Fournier. Automatic approximation of the marginal likelihood in non-gaussian hierarchical models. Computational Statistical Data Analysis, 51, 2006.
    Google ScholarLocate open access versionFindings
  • Eero Siivola, Juho Piironen, and Aki Vehtari. Automatic monotonicity detection for gaussian processes. In arXiv 1610.05440, 2016.
    Findings
  • Tomoyuki Shirai and Yoichiro Takahashi. Random point fields associated with certain fredholm determinants ii: Fermion shifts and their ergodic and gibbs properties. The Annals of Probability, (3), 07 2003.
    Google ScholarLocate open access versionFindings
  • M. Telgarsky. Boosting with the logistic loss is consistent. In 26 th COLT, pages 911–965, 2013.
    Google ScholarLocate open access versionFindings
  • [UKEC19] Ivan Ustyuzhaninov, Ieva Kazlauskaite, Carl Henrik Ek, and Neill D. F. Campbell. Monotonic gaussian process flow. In arXiv 1905.12930, 2019.
    Findings
  • Christian J. Walder and Adrian N. Bishop. Fast bayesian intensity estimation for the permanental process. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 2017.
    Google ScholarLocate open access versionFindings
  • Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
    Google ScholarFindings
  • Ang Yang, Cheng Li, Santu Rana, Sunil Gupta, and Svetha Venkatesh. Sparse approximation for gaussian process with derivative observations. In AI 2018: Advances in Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • L Yeganova and W Wilbur. Isotonic regression under lipschitz constraint. Journal of Optimization Theory and Applications, 2009.
    Google ScholarLocate open access versionFindings
  • Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Knowledge Discovery and Data Mining, 2002.
    Google ScholarLocate open access versionFindings
  • J. Zhang. Divergence function, duality, and convex analysis. Neural Computation, 16:159–195, 2004.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments