Rankmax: An Adaptive Projection Alternative to the Softmax Function
NIPS 2020, 2020.
EI
Weibo:
Abstract:
Many machine learning models involve mapping a score vector to a probability vector. Usually, this is done by projecting the score vector onto a probability simplex, and such projections are often characterized as Lipschitz continuous approximations of the argmax function, whose Lipschitz constant is controlled by a parameter that is simi...More
Code:
Data:
Introduction
- The goal of many machine learning models, such as multi-class classification or retrieval, is to learn a conditional probability distribution
- Such models often involve projecting a vector on the probability simplex, and a general form of such a projection is given by pα(z) = argmin x∈∆n−1.
- Problem (1) has been studied extensively
- It reduces to the Euclidean projection [27, 12] when g is the squared Euclidean norm, and the entropy projection [10, 5, 3] when g is the negative entropy.
- Pα is the widely used softmax function
Highlights
- The goal of many machine learning models, such as multi-class classification or retrieval, is to learn a conditional probability distribution
- We derived an adaptive Euclidean projection method motivated by multi-label classification problems
- Under cross-entropy loss, Rankmax is closely related to the pairwise losses as discussed in Section 3.3
- While pairwise losses do not immediately fit into the projection framework of equation (1), this connection suggests that they may be closely related, and we believe this merits further investigation
- The resulting method exhibits desirable properties, such as sparsity of its support and numerically efficient implementation, and we find that it significantly outperforms competing non-adaptive projection methods
- While we focused our discussion on the cross-entropy loss, the Rankmax projection can be used with other losses
Results
- The resulting method exhibits desirable properties, such as sparsity of its support and numerically efficient implementation, and the authors find that it significantly outperforms competing non-adaptive projection methods.
- On Movielens 20M, Rankmax produces a 15% improvement over Softmax and a 8% improvement over Sparsemax across a range of learning rates
Conclusion
- The authors derived an adaptive Euclidean projection method motivated by multi-label classification problems.
- The method adapts the parameter α to individual training examples, and shows good empirical performance.
- Under cross-entropy loss, Rankmax is closely related to the pairwise losses as discussed in Section 3.3.
- While pairwise losses do not immediately fit into the projection framework of equation (1), this connection suggests that they may be closely related, and the authors believe this merits further investigation.
- Combining the adaptivity of Rankmax with the Fenchel-Young losses [7, 8] is an interesting direction for future work
Summary
Introduction:
The goal of many machine learning models, such as multi-class classification or retrieval, is to learn a conditional probability distribution- Such models often involve projecting a vector on the probability simplex, and a general form of such a projection is given by pα(z) = argmin x∈∆n−1.
- Problem (1) has been studied extensively
- It reduces to the Euclidean projection [27, 12] when g is the squared Euclidean norm, and the entropy projection [10, 5, 3] when g is the negative entropy.
- Pα is the widely used softmax function
Results:
The resulting method exhibits desirable properties, such as sparsity of its support and numerically efficient implementation, and the authors find that it significantly outperforms competing non-adaptive projection methods.- On Movielens 20M, Rankmax produces a 15% improvement over Softmax and a 8% improvement over Sparsemax across a range of learning rates
Conclusion:
The authors derived an adaptive Euclidean projection method motivated by multi-label classification problems.- The method adapts the parameter α to individual training examples, and shows good empirical performance.
- Under cross-entropy loss, Rankmax is closely related to the pairwise losses as discussed in Section 3.3.
- While pairwise losses do not immediately fit into the projection framework of equation (1), this connection suggests that they may be closely related, and the authors believe this merits further investigation.
- Combining the adaptivity of Rankmax with the Fenchel-Young losses [7, 8] is an interesting direction for future work
Tables
- Table1: Characteristics of datasets used
- Table2: Qualitative comparison of loss functions
Funding
- The resulting method exhibits desirable properties, such as sparsity of its support and numerically efficient implementation, and we find that it significantly outperforms competing non-adaptive projection methods
- On Movielens 20M, Rankmax produces a 15% improvement over Softmax and a 8% improvement over Sparsemax across a range of learning rates
Reference
- Brandon Amos, Vladlen Koltun, and J. Zico Kolter. The limited multi-label projection layer. arXiv preprint arXiv:1906.08707, 2019. 1, 2.2, 3
- Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014. 1, 2, 2.2
- Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2001
- François Belletti, Karthik Lakshmanan, Walid Krichene, Nicolas Mayoraz, Yi-fan Chen, John Anderson, Taylor Robie, Tayo Oguntebi, Dan Shirron, and Amit Bleiwess. Scaling up collaborative filtering data sets through randomized fractal expansions. arXiv preprint arXiv:1905.09874, 2019. 4
- Aharon Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications, volume 2.
- Dimitris Bertsimas, John Tsitsiklis, et al. Simulated annealing. Statistical science, 8(1):10–15, 1993. 1
- Mathieu Blondel, Andre Martins, and Vlad Niculae. Learning classifiers with fenchel-young losses: Generalized entropies, margins, and algorithms. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 606–615. PMLR, 2019. 1, 2.2, 3, 3.2, 5
- Mathieu Blondel, André FT Martins, and Vlad Niculae. Learning with fenchel-young losses. Journal of Machine Learning Research, 21(35):1–69, 2020. 3, 3.2, 5
- Tzuu-Shuh Chiang, Chii-Ruey Hwang, and Shuenn Jyi Sheu. Diffusion for global optimization in Rn. SIAM Journal on Control and Optimization, 25(3):737–753, 1987. 1
- Imre Csiszár. I-divergence geometry of probability distributions and minimization problems. The annals of probability, pages 146–158, 1975. 1
- Marco Cuturi, Olivier Teboul, and Jean-Philippe Vert. Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, pages 6861–6871, 2019. 3
- John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272–279, 2008. 1
- Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017. 1
- Bruce Hajek. Cooling schedules for optimal annealing. Mathematics of operations research, 13(2):311–329, 1988. 1
- F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016. 4
- Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, pages 263–272, 2008. 3
- Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representation, 201, 4
- Satyen Kale, Lev Reyzin, and Robert E Schapire. Non-stochastic bandit slate problems. In Advances in Neural Information Processing Systems, pages 1054–1062, 2010. 1, 2
- Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983. 1
- Wouter M Koolen, Manfred K Warmuth, Jyrki Kivinen, et al. Hedging structured concepts. In COLT, pages 93–105. Citeseer, 2010. 1, 2
- Walid Krichene, Syrine Krichene, and Alexandre Bayen. Efficient bregman projections onto the simplex. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 3291–3298. IEEE, 2015. 1, 2.2
- Harold J Kushner. Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: global minimization via monte carlo. SIAM Journal on Applied Mathematics, 47(1):169–185, 1987. 1
- Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass svm. In Advances in Neural Information Processing Systems, pages 325–333, 2015. 3
- Maksim Lapin, Matthias Hein, and Bernt Schiele. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. IEEE transactions on pattern analysis and machine intelligence, 40(7):1533–1554, 2017. 3
- Cong Han Lim and Stephen J Wright. Efficient Bregman projections onto the permutahedron and related polytopes. In Artificial Intelligence and Statistics, pages 1205–1213, 2016. 2.2
- Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International Conference on Machine Learning, pages 1614– 1623, 2016. 3, 3.3, 4
- Christian Michelot. A finite algorithm for finding the projection of a point onto the canonical simplex of Rn. Journal of Optimization Theory and Applications, 50(1):195–200, 1986. 1
- Debasis Mitra, Fabio Romeo, and Alberto Sangiovanni-Vincentelli. Convergence and finite-time behavior of simulated annealing. Advances in applied probability, 18(3):747–771, 1986. 1
- Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004. 1
- Ankit Singh Rawat, Jiecao Chen, Felix Xinnan X Yu, Ananda Theertha Suresh, and Sanjiv Kumar. Sampled softmax with random fourier features. In Advances in Neural Information Processing Systems, pages 13834–13844, 2019. 1
- Kenneth Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210–2239, 1998. 1
- Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, and Antonio Torralba. Selfsupervised audio-visual co-segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2357–2361. IEEE, 2019. 1, 4
- Zhao Song, Ron Parr, and Lawrence Carin. Revisiting the softmax bellman operator: New benefits and new perspective. In International Conference on Machine Learning, pages 5916– 5925, 2019. 1
- Michael J Todd. On max-k-sums. Mathematical Programming, 171(1-2):489–517, 2018. 1
- Nicolas Usunier, David Buffoni, and Patrick Gallinari. Ranking with ordered weighted pairwise classification. In Proceedings of the 26th annual international conference on machine learning, pages 1057–1064, 2009. 3, 3.3
- Manfred K Warmuth and Dima Kuzmin. Randomized online pca algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9(Oct):2287– 2320, 2008. 1, 2, 2.2
- Canyi Lu Weiran Wang. Projection onto the capped simplex. arXiv preprint arXiv:1503.01002, 2015. 1, 2.2
- Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011. 3.3
Full Text
Tags
Comments