Supervised Learning: No Loss No Cry

ICML, pp. 7370-7380, 2020.

被引用0|浏览16
EI
微博一下
Fitting a loss that complies with Bayes decision theory implies not just to be able to learn a classifier, and a canonical link of a proper loss, and a proper canonical loss

摘要

Supervised learning requires the specification of a loss function to minimise. While the theory of admissible losses from both a computational and statistical perspective is well-developed, these offer a panoply of different choices. In practice, this choice is typically made in an \emph{ad hoc} manner. In hopes of making this procedure...更多

代码

数据

0
下载 PDF 全文
引用
微博一下
简介
  • Efficient supervised learning essentially started with the PAC framework of Valiant (1984), in which the goal was to learn in polynomial time a function being able to predict a label for i.i.d. inputs.
  • Comes a less known subtlety: a proper loss as commonly used for real-valued prediction, such as the square and logistic loss, involves an implicit canonical link (Reid & Williamson, 2010) function that maps class probabilities to real values.
  • This is exemplified by the sigmoid link in deep learning
重点内容
  • Efficient supervised learning essentially started with the PAC framework of Valiant (1984), in which the goal was to learn in polynomial time a function being able to predict a label for i.i.d. inputs
  • A zoo of losses started to be used for tractable machine learning (ML), the most popular ones built from the square loss and the logistic loss
  • Learning proper canonical losses We focus on learning class probabilities unrestricted to all losses having the expression in (13), but with the requirement that we use the canonical link: ψ =. −L, thereby imposing that we learn the loss as well via its link
  • Fitting a loss that complies with Bayes decision theory implies not just to be able to learn a classifier, and a canonical link of a proper loss, and a proper canonical loss
  • In a 2011 seminal work, Kakade et al made with the SLIsotron algorithm the first attempt at solving this bigger picture of supervised learning
  • We propose in this paper a more general approach grounded on a general Bregman formulation of differentiable proper canonical losses
结果
  • The authors present experiments illustrating: (a) the viability of the BregmanTron as an alternative to classic GLM or SlIsotron learning.

    (b) the nature of the loss functions learned by the BregmanTron, which are potentially asymmetric.

    (c) the potential of using the loss function learned by the BregmanTron as input to some downstream learner.

    Predictive performance of BregmanTron The authors compare BregmanTron as a generic binary classification method against the following baselines: logistic regression, GLMTron (Kakade et al, 2011) with u(·) the sigmoid, and SlIsotron.
  • The Bayes-optimal solution for Pr(Y = 1 | X) can be derived in this case: it takes the form of a sigmoid, as assumed by logistic regression, composed with a linear model proportional to the expectation.
  • In this case logistic regression works on a search space much smaller than BregmanTron and guaranteed to contain the optimum
结论
  • Bregman divergences have had a rich history outside of convex optimisation, where they were introduced (Bregman, 1967)
  • They are the canonical distortions on the manifold of parameters of exponential families in information geometry (Amari & Nagaoka, 2000), they have been introduced in normative economics in several contexts (Magdalou & Nock, 2011; Shorrocks, 1980).
  • The authors propose in this paper a more general approach grounded on a general Bregman formulation of differentiable proper canonical losses
总结
  • Introduction:

    Efficient supervised learning essentially started with the PAC framework of Valiant (1984), in which the goal was to learn in polynomial time a function being able to predict a label for i.i.d. inputs.
  • Comes a less known subtlety: a proper loss as commonly used for real-valued prediction, such as the square and logistic loss, involves an implicit canonical link (Reid & Williamson, 2010) function that maps class probabilities to real values.
  • This is exemplified by the sigmoid link in deep learning
  • Results:

    The authors present experiments illustrating: (a) the viability of the BregmanTron as an alternative to classic GLM or SlIsotron learning.

    (b) the nature of the loss functions learned by the BregmanTron, which are potentially asymmetric.

    (c) the potential of using the loss function learned by the BregmanTron as input to some downstream learner.

    Predictive performance of BregmanTron The authors compare BregmanTron as a generic binary classification method against the following baselines: logistic regression, GLMTron (Kakade et al, 2011) with u(·) the sigmoid, and SlIsotron.
  • The Bayes-optimal solution for Pr(Y = 1 | X) can be derived in this case: it takes the form of a sigmoid, as assumed by logistic regression, composed with a linear model proportional to the expectation.
  • In this case logistic regression works on a search space much smaller than BregmanTron and guaranteed to contain the optimum
  • Conclusion:

    Bregman divergences have had a rich history outside of convex optimisation, where they were introduced (Bregman, 1967)
  • They are the canonical distortions on the manifold of parameters of exponential families in information geometry (Amari & Nagaoka, 2000), they have been introduced in normative economics in several contexts (Magdalou & Nock, 2011; Shorrocks, 1980).
  • The authors propose in this paper a more general approach grounded on a general Bregman formulation of differentiable proper canonical losses
表格
  • Table1: Test set AUC of various methods on binary classification datasets. See text for details
相关工作
  • Our problem of interest is learning not only a classifier, but also a loss function itself. A minimal requirement for the loss to be useful is that it is proper, i.e., it preserves the Bayes classification rule. Constraining our loss to this set ensures standard guarantees on the classification performance using this loss, e.g., using surrogate regret bounds.

    Evidently, when choosing amongst losses, we must have a well-defined objective. We now reinterpret an algorithm of Kakade et al (2011) as providing such an objective.

    The SlIsotron algorithm. Kakade et al (2011) considered the problem of learning a class-probability model of the form Pr(Y = 1 | x) = u(w∗ x) where u(·) is a 1-Lipschitz, non-decreasing function, and w∗ ∈ Rd is a fixed vector. They proposed SlIsotron, an iterative algorithm that alternates between gradient steps to estimate w∗, and nonparametric isotonic regression steps to estimate u. SlIsotron provably bounds the expected square loss, i.e., sψq(S, h) = Ex∼S Ey∗∼S[(y − ψ−1(h(x)))2|x] ,
引用论文
  • Amari, S.-I. and Nagaoka, H. Methods of Information Geometry. Oxford University Press, 2000.
    Google ScholarFindings
  • Auer, P., Herbster, M., and Warmuth, M. Exponentially many local minima for single neurons. In NIPS*8, pp. 316–322, 1995.
    Google ScholarLocate open access versionFindings
  • Azoury, K. S. and Warmuth, M. K. Relative loss bounds for on-line density estimation with the exponential family of distributions. MLJ, 43(3):211–246, 2001.
    Google ScholarLocate open access versionFindings
  • Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. Clustering with bregman divergences. In Proc. of the 4th SIAM International Conference on Data Mining, pp. 234–245, 2004.
    Google ScholarLocate open access versionFindings
  • Banerjee, A., Guo, X., and Wang, H. On the optimality of conditional expectation as a bregman predictor. IEEE Trans. IT, 51:2664–2669, 2005.
    Google ScholarLocate open access versionFindings
  • Bartlett, P.-L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, 2002.
    Google ScholarLocate open access versionFindings
  • Boissonnat, J.-D., Nielsen, F., and Nock, R. Bregman voronoi diagrams. DCG, 44(2):281–307, 2010.
    Google ScholarLocate open access versionFindings
  • Bousquet, O., Kane, D., and Moran, S. The optimal approximation factor in density estimation. In COLT’19, pp. 318–341, 2019.
    Google ScholarLocate open access versionFindings
  • Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.
    Google ScholarFindings
  • Bregman, L. M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200–217, 1967.
    Google ScholarLocate open access versionFindings
  • Buja, A., Stuetzle, W., and Shen, Y. Loss functions for binary class probability estimation ans classification: structure and applications, 2005. Technical Report, University of Pennsylvania.
    Google ScholarFindings
  • Cranko, Z., Menon, A.-K., Nock, R., Ong, C. S., Shi, Z., and Walder, C.-J. Monge blunts Bayes: Hardness results for adversarial training. In 36th ICML, pp. 1406–1415, 2019.
    Google ScholarLocate open access versionFindings
  • de Finetti, B. Rôle et domaine d’application du théorème de Bayes selon les différents points de vue sur les probabilités (in French). In 18th International Congress on the Philosophy of Sciences, pp. 67–82, 1949.
    Google ScholarLocate open access versionFindings
  • Fleischner, H. The square of every two-connected graph is Hamiltonian. Journal of Combinatorial Theory, Series B, 16:29–34, 1974.
    Google ScholarLocate open access versionFindings
  • Grabocka, J., Scholz, R., and Schmidt-Thieme, L. Learning surrogate losses. CoRR, abs/1905.10108, 2019.
    Findings
  • Gross, J.-L. and Yellen, J. Handbook of graph theory. CRC press, 2004. ISBN 1-58488-090-2.
    Google ScholarFindings
  • Helmbold, D.-P., Kivinen, J., and Warmuth, M.-K. Worst-case loss bounds for single neurons. In NIPS*8, pp. 309–315, 1995.
    Google ScholarLocate open access versionFindings
  • Herbster, M. and Warmuth, M. Tracking the best regressor. In 9 th COLT, pp. 24–31, 1998.
    Google ScholarLocate open access versionFindings
  • Kakade, S., Kalai, A.-T., Kanade, V., and Shamir, O. Efficient learning of generalized linear and single index models with isotonic regression. In NIPS*24, pp. 927–935, 2011.
    Google ScholarLocate open access versionFindings
  • Kearns, M. J. and Vazirani, U. V. An Introduction to Computational Learning Theory. M.I.T. Press, 1994.
    Google ScholarLocate open access versionFindings
  • Liu, L., Wang, M., and Deng, J. UniLoss: Unified surrogate loss by adaptive interpolation. https://openreview.net/forum?id=ryegXAVKDB, 2019.
    Findings
  • Magdalou, B. and Nock, R. Income distributions and decomposable divergence measures. Journal of Economic Theory, 146(6):2440–2454, 2011.
    Google ScholarLocate open access versionFindings
  • Mei, J. and Moura, J.-M.-F. SILVar: Single index latent variable models. IEEE Trans. Signal Processing, 66(11):2790–2803, 2018.
    Google ScholarLocate open access versionFindings
  • Nock, R. and Nielsen, F. On the efficient minimization of classification-calibrated surrogates. In NIPS*21, pp. 1201–1208, 2008.
    Google ScholarLocate open access versionFindings
  • Nock, R. and Nielsen, F. Bregman divergences and surrogates for learning. IEEE Trans.PAMI, 31:2048–2059, 2009.
    Google ScholarLocate open access versionFindings
  • Nock, R. and Williamson, R.-C. Lossless or quantized boosting with integer arithmetic. In 36th ICML, pp. 4829–4838, 2019.
    Google ScholarLocate open access versionFindings
  • Nock, R., Luosto, P., and Kivinen, J. Mixed Bregman clustering with approximation guarantees. In Proc. of the 19 th ECML, pp. 154–169, 2008.
    Google ScholarLocate open access versionFindings
  • Nock, R., Menon, A.-K., and Ong, C.-S. A scaled Bregman theorem with applications. In NIPS*29, pp. 19–27, 2016.
    Google ScholarLocate open access versionFindings
  • Patrini, G., Nock, R., Rivera, P., and Caetano, T. (Almost) no label no cry. In NIPS*27, 2014.
    Google ScholarLocate open access versionFindings
  • Reid, M.-D. and Williamson, R.-C. Composite binary losses. JMLR, 11:2387–2422, 2010.
    Google ScholarLocate open access versionFindings
  • Savage, L.-J. Elicitation of personal probabilities and expectations. J. of the Am. Stat. Assoc., 66:783–801, 1971.
    Google ScholarLocate open access versionFindings
  • Shorrocks, A.-F. The class of additively decomposable inequality measures. Econometrica, 48:613–625, 1980.
    Google ScholarLocate open access versionFindings
  • Shuford, E., Albert, A., and Massengil, H.-E. Admissible probability measurement procedures. Psychometrika, pp. 125–145, 1966.
    Google ScholarLocate open access versionFindings
  • Siahkamari, A., Saligrama, V., Castanon, D., and Kulis, B. Learning Bregman divergences. CoRR, abs/1905.11545, 2019.
    Findings
  • Streeter, M. Learning effective loss functions efficiently. CoRR, abs/1907.00103, 2019.
    Findings
  • Sypherd, T., Diaz, M., Laddha, H., Sankar, L., Kairouz, P., and Dasarathy, G. A tunable loss function for classification. CoRR, abs/1906.02314, 2019.
    Findings
  • Valiant, L. G. A theory of the learnable. Communications of the ACM, 27:1134–1142, 1984.
    Google ScholarLocate open access versionFindings
  • Zhang, T. Statistical behaviour and consistency of classification methods based on convex risk minimization. Annals of Mathematical Statistics, 32:56—-134, 2004.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论