# Supervised Learning: No Loss No Cry

ICML, pp. 7370-7380, 2020.

EI

微博一下：

摘要：

Supervised learning requires the specification of a loss function to minimise. While the theory of admissible losses from both a computational and statistical perspective is well-developed, these offer a panoply of different choices. In practice, this choice is typically made in an \emph{ad hoc} manner. In hopes of making this procedure...更多

代码：

数据：

简介

- Efficient supervised learning essentially started with the PAC framework of Valiant (1984), in which the goal was to learn in polynomial time a function being able to predict a label for i.i.d. inputs.
- Comes a less known subtlety: a proper loss as commonly used for real-valued prediction, such as the square and logistic loss, involves an implicit canonical link (Reid & Williamson, 2010) function that maps class probabilities to real values.
- This is exemplified by the sigmoid link in deep learning

重点内容

- Efficient supervised learning essentially started with the PAC framework of Valiant (1984), in which the goal was to learn in polynomial time a function being able to predict a label for i.i.d. inputs
- A zoo of losses started to be used for tractable machine learning (ML), the most popular ones built from the square loss and the logistic loss
- Learning proper canonical losses We focus on learning class probabilities unrestricted to all losses having the expression in (13), but with the requirement that we use the canonical link: ψ =. −L, thereby imposing that we learn the loss as well via its link
- Fitting a loss that complies with Bayes decision theory implies not just to be able to learn a classifier, and a canonical link of a proper loss, and a proper canonical loss
- In a 2011 seminal work, Kakade et al made with the SLIsotron algorithm the first attempt at solving this bigger picture of supervised learning
- We propose in this paper a more general approach grounded on a general Bregman formulation of differentiable proper canonical losses

结果

- The authors present experiments illustrating: (a) the viability of the BregmanTron as an alternative to classic GLM or SlIsotron learning.

(b) the nature of the loss functions learned by the BregmanTron, which are potentially asymmetric.

(c) the potential of using the loss function learned by the BregmanTron as input to some downstream learner.

Predictive performance of BregmanTron The authors compare BregmanTron as a generic binary classification method against the following baselines: logistic regression, GLMTron (Kakade et al, 2011) with u(·) the sigmoid, and SlIsotron. - The Bayes-optimal solution for Pr(Y = 1 | X) can be derived in this case: it takes the form of a sigmoid, as assumed by logistic regression, composed with a linear model proportional to the expectation.
- In this case logistic regression works on a search space much smaller than BregmanTron and guaranteed to contain the optimum

结论

- Bregman divergences have had a rich history outside of convex optimisation, where they were introduced (Bregman, 1967)
- They are the canonical distortions on the manifold of parameters of exponential families in information geometry (Amari & Nagaoka, 2000), they have been introduced in normative economics in several contexts (Magdalou & Nock, 2011; Shorrocks, 1980).
- The authors propose in this paper a more general approach grounded on a general Bregman formulation of differentiable proper canonical losses

总结

## Introduction:

Efficient supervised learning essentially started with the PAC framework of Valiant (1984), in which the goal was to learn in polynomial time a function being able to predict a label for i.i.d. inputs.- Comes a less known subtlety: a proper loss as commonly used for real-valued prediction, such as the square and logistic loss, involves an implicit canonical link (Reid & Williamson, 2010) function that maps class probabilities to real values.
- This is exemplified by the sigmoid link in deep learning
## Results:

The authors present experiments illustrating: (a) the viability of the BregmanTron as an alternative to classic GLM or SlIsotron learning.

(b) the nature of the loss functions learned by the BregmanTron, which are potentially asymmetric.

(c) the potential of using the loss function learned by the BregmanTron as input to some downstream learner.

Predictive performance of BregmanTron The authors compare BregmanTron as a generic binary classification method against the following baselines: logistic regression, GLMTron (Kakade et al, 2011) with u(·) the sigmoid, and SlIsotron.- The Bayes-optimal solution for Pr(Y = 1 | X) can be derived in this case: it takes the form of a sigmoid, as assumed by logistic regression, composed with a linear model proportional to the expectation.
- In this case logistic regression works on a search space much smaller than BregmanTron and guaranteed to contain the optimum
## Conclusion:

Bregman divergences have had a rich history outside of convex optimisation, where they were introduced (Bregman, 1967)- They are the canonical distortions on the manifold of parameters of exponential families in information geometry (Amari & Nagaoka, 2000), they have been introduced in normative economics in several contexts (Magdalou & Nock, 2011; Shorrocks, 1980).
- The authors propose in this paper a more general approach grounded on a general Bregman formulation of differentiable proper canonical losses

- Table1: Test set AUC of various methods on binary classification datasets. See text for details

相关工作

- Our problem of interest is learning not only a classifier, but also a loss function itself. A minimal requirement for the loss to be useful is that it is proper, i.e., it preserves the Bayes classification rule. Constraining our loss to this set ensures standard guarantees on the classification performance using this loss, e.g., using surrogate regret bounds.

Evidently, when choosing amongst losses, we must have a well-defined objective. We now reinterpret an algorithm of Kakade et al (2011) as providing such an objective.

The SlIsotron algorithm. Kakade et al (2011) considered the problem of learning a class-probability model of the form Pr(Y = 1 | x) = u(w∗ x) where u(·) is a 1-Lipschitz, non-decreasing function, and w∗ ∈ Rd is a fixed vector. They proposed SlIsotron, an iterative algorithm that alternates between gradient steps to estimate w∗, and nonparametric isotonic regression steps to estimate u. SlIsotron provably bounds the expected square loss, i.e., sψq(S, h) = Ex∼S Ey∗∼S[(y − ψ−1(h(x)))2|x] ,

引用论文

- Amari, S.-I. and Nagaoka, H. Methods of Information Geometry. Oxford University Press, 2000.
- Auer, P., Herbster, M., and Warmuth, M. Exponentially many local minima for single neurons. In NIPS*8, pp. 316–322, 1995.
- Azoury, K. S. and Warmuth, M. K. Relative loss bounds for on-line density estimation with the exponential family of distributions. MLJ, 43(3):211–246, 2001.
- Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. Clustering with bregman divergences. In Proc. of the 4th SIAM International Conference on Data Mining, pp. 234–245, 2004.
- Banerjee, A., Guo, X., and Wang, H. On the optimality of conditional expectation as a bregman predictor. IEEE Trans. IT, 51:2664–2669, 2005.
- Bartlett, P.-L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, 2002.
- Boissonnat, J.-D., Nielsen, F., and Nock, R. Bregman voronoi diagrams. DCG, 44(2):281–307, 2010.
- Bousquet, O., Kane, D., and Moran, S. The optimal approximation factor in density estimation. In COLT’19, pp. 318–341, 2019.
- Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.
- Bregman, L. M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200–217, 1967.
- Buja, A., Stuetzle, W., and Shen, Y. Loss functions for binary class probability estimation ans classification: structure and applications, 2005. Technical Report, University of Pennsylvania.
- Cranko, Z., Menon, A.-K., Nock, R., Ong, C. S., Shi, Z., and Walder, C.-J. Monge blunts Bayes: Hardness results for adversarial training. In 36th ICML, pp. 1406–1415, 2019.
- de Finetti, B. Rôle et domaine d’application du théorème de Bayes selon les différents points de vue sur les probabilités (in French). In 18th International Congress on the Philosophy of Sciences, pp. 67–82, 1949.
- Fleischner, H. The square of every two-connected graph is Hamiltonian. Journal of Combinatorial Theory, Series B, 16:29–34, 1974.
- Grabocka, J., Scholz, R., and Schmidt-Thieme, L. Learning surrogate losses. CoRR, abs/1905.10108, 2019.
- Gross, J.-L. and Yellen, J. Handbook of graph theory. CRC press, 2004. ISBN 1-58488-090-2.
- Helmbold, D.-P., Kivinen, J., and Warmuth, M.-K. Worst-case loss bounds for single neurons. In NIPS*8, pp. 309–315, 1995.
- Herbster, M. and Warmuth, M. Tracking the best regressor. In 9 th COLT, pp. 24–31, 1998.
- Kakade, S., Kalai, A.-T., Kanade, V., and Shamir, O. Efficient learning of generalized linear and single index models with isotonic regression. In NIPS*24, pp. 927–935, 2011.
- Kearns, M. J. and Vazirani, U. V. An Introduction to Computational Learning Theory. M.I.T. Press, 1994.
- Liu, L., Wang, M., and Deng, J. UniLoss: Unified surrogate loss by adaptive interpolation. https://openreview.net/forum?id=ryegXAVKDB, 2019.
- Magdalou, B. and Nock, R. Income distributions and decomposable divergence measures. Journal of Economic Theory, 146(6):2440–2454, 2011.
- Mei, J. and Moura, J.-M.-F. SILVar: Single index latent variable models. IEEE Trans. Signal Processing, 66(11):2790–2803, 2018.
- Nock, R. and Nielsen, F. On the efficient minimization of classification-calibrated surrogates. In NIPS*21, pp. 1201–1208, 2008.
- Nock, R. and Nielsen, F. Bregman divergences and surrogates for learning. IEEE Trans.PAMI, 31:2048–2059, 2009.
- Nock, R. and Williamson, R.-C. Lossless or quantized boosting with integer arithmetic. In 36th ICML, pp. 4829–4838, 2019.
- Nock, R., Luosto, P., and Kivinen, J. Mixed Bregman clustering with approximation guarantees. In Proc. of the 19 th ECML, pp. 154–169, 2008.
- Nock, R., Menon, A.-K., and Ong, C.-S. A scaled Bregman theorem with applications. In NIPS*29, pp. 19–27, 2016.
- Patrini, G., Nock, R., Rivera, P., and Caetano, T. (Almost) no label no cry. In NIPS*27, 2014.
- Reid, M.-D. and Williamson, R.-C. Composite binary losses. JMLR, 11:2387–2422, 2010.
- Savage, L.-J. Elicitation of personal probabilities and expectations. J. of the Am. Stat. Assoc., 66:783–801, 1971.
- Shorrocks, A.-F. The class of additively decomposable inequality measures. Econometrica, 48:613–625, 1980.
- Shuford, E., Albert, A., and Massengil, H.-E. Admissible probability measurement procedures. Psychometrika, pp. 125–145, 1966.
- Siahkamari, A., Saligrama, V., Castanon, D., and Kulis, B. Learning Bregman divergences. CoRR, abs/1905.11545, 2019.
- Streeter, M. Learning effective loss functions efficiently. CoRR, abs/1907.00103, 2019.
- Sypherd, T., Diaz, M., Laddha, H., Sankar, L., Kairouz, P., and Dasarathy, G. A tunable loss function for classification. CoRR, abs/1906.02314, 2019.
- Valiant, L. G. A theory of the learnable. Communications of the ACM, 27:1134–1142, 1984.
- Zhang, T. Statistical behaviour and consistency of classification methods based on convex risk minimization. Annals of Mathematical Statistics, 32:56—-134, 2004.

标签

评论