Rankmax: An Adaptive Projection Alternative to the Softmax Function

Weiwei Kong
Weiwei Kong
Nicolas E Mayoraz
Nicolas E Mayoraz

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views66
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We derived an adaptive Euclidean projection method motivated by multi-label classification problems

Abstract:

Many machine learning models involve mapping a score vector to a probability vector. Usually, this is done by projecting the score vector onto a probability simplex, and such projections are often characterized as Lipschitz continuous approximations of the argmax function, whose Lipschitz constant is controlled by a parameter that is simi...More

Code:

Data:

0
Introduction
  • The goal of many machine learning models, such as multi-class classification or retrieval, is to learn a conditional probability distribution
  • Such models often involve projecting a vector on the probability simplex, and a general form of such a projection is given by pα(z) = argmin x∈∆n−1.
  • Problem (1) has been studied extensively
  • It reduces to the Euclidean projection [27, 12] when g is the squared Euclidean norm, and the entropy projection [10, 5, 3] when g is the negative entropy.
  • Pα is the widely used softmax function
Highlights
  • The goal of many machine learning models, such as multi-class classification or retrieval, is to learn a conditional probability distribution
  • We derived an adaptive Euclidean projection method motivated by multi-label classification problems
  • Under cross-entropy loss, Rankmax is closely related to the pairwise losses as discussed in Section 3.3
  • While pairwise losses do not immediately fit into the projection framework of equation (1), this connection suggests that they may be closely related, and we believe this merits further investigation
  • The resulting method exhibits desirable properties, such as sparsity of its support and numerically efficient implementation, and we find that it significantly outperforms competing non-adaptive projection methods
  • While we focused our discussion on the cross-entropy loss, the Rankmax projection can be used with other losses
Results
  • The resulting method exhibits desirable properties, such as sparsity of its support and numerically efficient implementation, and the authors find that it significantly outperforms competing non-adaptive projection methods.
  • On Movielens 20M, Rankmax produces a 15% improvement over Softmax and a 8% improvement over Sparsemax across a range of learning rates
Conclusion
  • The authors derived an adaptive Euclidean projection method motivated by multi-label classification problems.
  • The method adapts the parameter α to individual training examples, and shows good empirical performance.
  • Under cross-entropy loss, Rankmax is closely related to the pairwise losses as discussed in Section 3.3.
  • While pairwise losses do not immediately fit into the projection framework of equation (1), this connection suggests that they may be closely related, and the authors believe this merits further investigation.
  • Combining the adaptivity of Rankmax with the Fenchel-Young losses [7, 8] is an interesting direction for future work
Summary
  • Introduction:

    The goal of many machine learning models, such as multi-class classification or retrieval, is to learn a conditional probability distribution
  • Such models often involve projecting a vector on the probability simplex, and a general form of such a projection is given by pα(z) = argmin x∈∆n−1.
  • Problem (1) has been studied extensively
  • It reduces to the Euclidean projection [27, 12] when g is the squared Euclidean norm, and the entropy projection [10, 5, 3] when g is the negative entropy.
  • Pα is the widely used softmax function
  • Results:

    The resulting method exhibits desirable properties, such as sparsity of its support and numerically efficient implementation, and the authors find that it significantly outperforms competing non-adaptive projection methods.
  • On Movielens 20M, Rankmax produces a 15% improvement over Softmax and a 8% improvement over Sparsemax across a range of learning rates
  • Conclusion:

    The authors derived an adaptive Euclidean projection method motivated by multi-label classification problems.
  • The method adapts the parameter α to individual training examples, and shows good empirical performance.
  • Under cross-entropy loss, Rankmax is closely related to the pairwise losses as discussed in Section 3.3.
  • While pairwise losses do not immediately fit into the projection framework of equation (1), this connection suggests that they may be closely related, and the authors believe this merits further investigation.
  • Combining the adaptivity of Rankmax with the Fenchel-Young losses [7, 8] is an interesting direction for future work
Tables
  • Table1: Characteristics of datasets used
  • Table2: Qualitative comparison of loss functions
Download tables as Excel
Funding
  • The resulting method exhibits desirable properties, such as sparsity of its support and numerically efficient implementation, and we find that it significantly outperforms competing non-adaptive projection methods
  • On Movielens 20M, Rankmax produces a 15% improvement over Softmax and a 8% improvement over Sparsemax across a range of learning rates
Reference
  • Brandon Amos, Vladlen Koltun, and J. Zico Kolter. The limited multi-label projection layer. arXiv preprint arXiv:1906.08707, 2019. 1, 2.2, 3
    Findings
  • Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014. 1, 2, 2.2
    Google ScholarLocate open access versionFindings
  • Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2001
    Google ScholarLocate open access versionFindings
  • François Belletti, Karthik Lakshmanan, Walid Krichene, Nicolas Mayoraz, Yi-fan Chen, John Anderson, Taylor Robie, Tayo Oguntebi, Dan Shirron, and Amit Bleiwess. Scaling up collaborative filtering data sets through randomized fractal expansions. arXiv preprint arXiv:1905.09874, 2019. 4
    Findings
  • Aharon Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications, volume 2.
    Google ScholarLocate open access versionFindings
  • Dimitris Bertsimas, John Tsitsiklis, et al. Simulated annealing. Statistical science, 8(1):10–15, 1993. 1
    Google ScholarLocate open access versionFindings
  • Mathieu Blondel, Andre Martins, and Vlad Niculae. Learning classifiers with fenchel-young losses: Generalized entropies, margins, and algorithms. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 606–615. PMLR, 2019. 1, 2.2, 3, 3.2, 5
    Google ScholarLocate open access versionFindings
  • Mathieu Blondel, André FT Martins, and Vlad Niculae. Learning with fenchel-young losses. Journal of Machine Learning Research, 21(35):1–69, 2020. 3, 3.2, 5
    Google ScholarLocate open access versionFindings
  • Tzuu-Shuh Chiang, Chii-Ruey Hwang, and Shuenn Jyi Sheu. Diffusion for global optimization in Rn. SIAM Journal on Control and Optimization, 25(3):737–753, 1987. 1
    Google ScholarLocate open access versionFindings
  • Imre Csiszár. I-divergence geometry of probability distributions and minimization problems. The annals of probability, pages 146–158, 1975. 1
    Google ScholarFindings
  • Marco Cuturi, Olivier Teboul, and Jean-Philippe Vert. Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, pages 6861–6871, 2019. 3
    Google ScholarLocate open access versionFindings
  • John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272–279, 2008. 1
    Google ScholarLocate open access versionFindings
  • Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017. 1
    Google ScholarLocate open access versionFindings
  • Bruce Hajek. Cooling schedules for optimal annealing. Mathematics of operations research, 13(2):311–329, 1988. 1
    Google ScholarLocate open access versionFindings
  • F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016. 4
    Google ScholarFindings
  • Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, pages 263–272, 2008. 3
    Google ScholarLocate open access versionFindings
  • Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representation, 201, 4
    Google ScholarLocate open access versionFindings
  • Satyen Kale, Lev Reyzin, and Robert E Schapire. Non-stochastic bandit slate problems. In Advances in Neural Information Processing Systems, pages 1054–1062, 2010. 1, 2
    Google ScholarLocate open access versionFindings
  • Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983. 1
    Google ScholarLocate open access versionFindings
  • Wouter M Koolen, Manfred K Warmuth, Jyrki Kivinen, et al. Hedging structured concepts. In COLT, pages 93–105. Citeseer, 2010. 1, 2
    Google ScholarLocate open access versionFindings
  • Walid Krichene, Syrine Krichene, and Alexandre Bayen. Efficient bregman projections onto the simplex. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 3291–3298. IEEE, 2015. 1, 2.2
    Google ScholarLocate open access versionFindings
  • Harold J Kushner. Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: global minimization via monte carlo. SIAM Journal on Applied Mathematics, 47(1):169–185, 1987. 1
    Google ScholarLocate open access versionFindings
  • Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass svm. In Advances in Neural Information Processing Systems, pages 325–333, 2015. 3
    Google ScholarLocate open access versionFindings
  • Maksim Lapin, Matthias Hein, and Bernt Schiele. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. IEEE transactions on pattern analysis and machine intelligence, 40(7):1533–1554, 2017. 3
    Google ScholarLocate open access versionFindings
  • Cong Han Lim and Stephen J Wright. Efficient Bregman projections onto the permutahedron and related polytopes. In Artificial Intelligence and Statistics, pages 1205–1213, 2016. 2.2
    Google ScholarLocate open access versionFindings
  • Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International Conference on Machine Learning, pages 1614– 1623, 2016. 3, 3.3, 4
    Google ScholarLocate open access versionFindings
  • Christian Michelot. A finite algorithm for finding the projection of a point onto the canonical simplex of Rn. Journal of Optimization Theory and Applications, 50(1):195–200, 1986. 1
    Google ScholarLocate open access versionFindings
  • Debasis Mitra, Fabio Romeo, and Alberto Sangiovanni-Vincentelli. Convergence and finite-time behavior of simulated annealing. Advances in applied probability, 18(3):747–771, 1986. 1
    Google ScholarLocate open access versionFindings
  • Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004. 1
    Google ScholarLocate open access versionFindings
  • Ankit Singh Rawat, Jiecao Chen, Felix Xinnan X Yu, Ananda Theertha Suresh, and Sanjiv Kumar. Sampled softmax with random fourier features. In Advances in Neural Information Processing Systems, pages 13834–13844, 2019. 1
    Google ScholarLocate open access versionFindings
  • Kenneth Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210–2239, 1998. 1
    Google ScholarLocate open access versionFindings
  • Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, and Antonio Torralba. Selfsupervised audio-visual co-segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2357–2361. IEEE, 2019. 1, 4
    Google ScholarLocate open access versionFindings
  • Zhao Song, Ron Parr, and Lawrence Carin. Revisiting the softmax bellman operator: New benefits and new perspective. In International Conference on Machine Learning, pages 5916– 5925, 2019. 1
    Google ScholarLocate open access versionFindings
  • Michael J Todd. On max-k-sums. Mathematical Programming, 171(1-2):489–517, 2018. 1
    Google ScholarLocate open access versionFindings
  • Nicolas Usunier, David Buffoni, and Patrick Gallinari. Ranking with ordered weighted pairwise classification. In Proceedings of the 26th annual international conference on machine learning, pages 1057–1064, 2009. 3, 3.3
    Google ScholarLocate open access versionFindings
  • Manfred K Warmuth and Dima Kuzmin. Randomized online pca algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9(Oct):2287– 2320, 2008. 1, 2, 2.2
    Google ScholarLocate open access versionFindings
  • Canyi Lu Weiran Wang. Projection onto the capped simplex. arXiv preprint arXiv:1503.01002, 2015. 1, 2.2
    Findings
  • Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011. 3.3
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments