# Gradient Estimation with Stochastic Softmax Tricks

NIPS 2020, 2020.

EI

微博一下：

摘要：

The Gumbel-Max trick is the basis of many relaxed gradient estimators. These estimators are easy to implement and low variance, but the goal of scaling them comprehensively to large combinatorial distributions is still outstanding. Working within the perturbation model framework, we introduce stochastic softmax tricks, which generalize ...更多

代码：

数据：

简介

- Gradient computation is the methodological backbone of deep learning, but computing gradients is not always easy.
- The Gumbel-Softmax estimator is the simplest; it continuously approximates the GumbelMax trick to admit a reparameterization gradient [37, 67, 71].
- This is used to optimize the “soft” approximation of the loss as a surrogate for the “hard” discrete objective.

重点内容

- Gradient computation is the methodological backbone of deep learning, but computing gradients is not always easy
- We address gradient estimation for discrete distributions with an emphasis on latent variable models
- We introduce stochastic softmax tricks (SSTs), which are a unified framework for designing structured relaxations of combinatorial distributions
- Relaxed gradient estimators assume that L is differentiable and use a change of variables to remove the dependence of pθ on θ, known as the reparameterization trick [37, 67]
- The Gumbel-Softmax trick (GST) [52, 35] is a simple relaxed gradient estimator for one-hot embeddings, which is based on the Gumbel-Max trick (GMT) [51, 53]
- We introduced stochastic softmax tricks, which are random convex programs that capture a large class of relaxed distributions over structured, combinatorial spaces

方法

- The authors' goal in these experiments was to evaluate the use of SSTs for learning distributions over structured latent spaces in deep structured models.
- For NRI, the authors implemented the standard single-loss-evaluation score function estimators (REINFORCE [82] and NVIL [59]), but struggled to achieve competitive results, see App. C.
- All SST models were trained with the “soft” SST and evaluated with the “hard” SMT.
- The authors selected models on a validation set according to the best objective value obtained during training.

结论

- The authors introduced stochastic softmax tricks, which are random convex programs that capture a large class of relaxed distributions over structured, combinatorial spaces.
- The authors designed stochastic softmax tricks for subset selection and a variety of spanning tree distributions.
- The authors tested their use in deep latent variable models, and found that they can be used to improve performance and to encourage the unsupervised discovery of true latent structure.
- Some combinatorial objects might benefit from a more careful design of the utility distribution, while others, e.g., matchings, are still waiting to have their tricks designed

总结

## Introduction:

Gradient computation is the methodological backbone of deep learning, but computing gradients is not always easy.- The Gumbel-Softmax estimator is the simplest; it continuously approximates the GumbelMax trick to admit a reparameterization gradient [37, 67, 71].
- This is used to optimize the “soft” approximation of the loss as a surrogate for the “hard” discrete objective.
## Methods:

The authors' goal in these experiments was to evaluate the use of SSTs for learning distributions over structured latent spaces in deep structured models.- For NRI, the authors implemented the standard single-loss-evaluation score function estimators (REINFORCE [82] and NVIL [59]), but struggled to achieve competitive results, see App. C.
- All SST models were trained with the “soft” SST and evaluated with the “hard” SMT.
- The authors selected models on a validation set according to the best objective value obtained during training.
## Conclusion:

The authors introduced stochastic softmax tricks, which are random convex programs that capture a large class of relaxed distributions over structured, combinatorial spaces.- The authors designed stochastic softmax tricks for subset selection and a variety of spanning tree distributions.
- The authors tested their use in deep latent variable models, and found that they can be used to improve performance and to encourage the unsupervised discovery of true latent structure.
- Some combinatorial objects might benefit from a more careful design of the utility distribution, while others, e.g., matchings, are still waiting to have their tricks designed

- Table1: Table 1
- Table2: Matching ground truth structure (non-tree → tree) improves performance on ListOps. The utility distribution impacts performance. Test task accuracy and structure recovery metrics are shown from models selected on valid. task accuracy
- Table3: Table 3
- Table4: For k-subset selection on appearance aspect, SSTs select subsets with high precision and outperform baseline relaxations. Test set MSE and subset precision is shown for models selected on valid. MSE
- Table5: For k-subset selection on palate aspect, SSTs tend to outperform baseline relaxations. Test set MSE and subset precision is shown for models selected on valid. MSE
- Table6: For k-subset selection on taste aspect, MSE and subset precision tend to be lower for all methods. This is because the taste rating is highly correlated with other ratings making it difficult to identify subsets with high precision. SSTs achieve small improvements. Test set MSE and subset precision is shown for models selected on valid. MSE
- Table7: NVIL and REINFORCE fails to get competitive results to their SST counterparts. Top |V | − 1 and Spanning Tree fail to learn edge structure for both REINFORCE and NVIL

相关工作

- Here we review perturbation models (PMs) and methods for relaxation more generally. SSTs are a subclass of PMs, which draw samples by optimizing a random objective. Perhaps the earliest example comes from Thurstonian ranking models [78], where a distribution over rankings is formed by sorting a vector of noisy scores. Perturb & MAP models [63, 33] were designed to approximate the Gibbs distribution over a combinatorial output space using low-order, additive Gumbel noise. Randomized Optimum models [76, 27] are the most general class, which include non-additive noise distributions and non-linear objectives. Recent work [50] uses PMs to construct finite difference approximations of the expected loss’ gradient. It requires optimizing a non-linear objective over X , and making this applicable to our settings would require significant innovation.

基金

- MBP gratefully acknowledges support from the Max Planck ETH Center for Learning Systems

引用论文

- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv e-prints, page arXiv:1603.04467, March 2016.
- Ryan Prescott Adams and Richard S Zemel. Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925, 2011.
- A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and Z. Kolter. Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, 2019.
- Akshay Agrawal, Shane Barratt, Stephen Boyd, Enzo Busseti, and Walaa M Moursi. Differentiating through a conic program. arXiv preprint arXiv:1904.09043, 2019.
- Brandon Amos. Differentiable optimization-based modeling for machine learning. PhD thesis, PhD thesis. Carnegie Mellon University, 2019.
- Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 136–145. JMLR. org, 2017.
- Brandon Amos, Vladlen Koltun, and J. Zico Kolter. The Limited Multi-Label Projection Layer. arXiv e-prints, page arXiv:1906.08707, June 2019.
- Søren Asmussen and Peter W Glynn. Stochastic simulation: algorithms and analysis, volume 57. Springer Science & Business Media, 2007.
- Michalis Titsias RC AUEB and Miguel Lázaro-Gredilla. Local expectation gradients for black box variational inference. In Advances in neural information processing systems, pages 2638–2646, 2015.
- Amir Beck. First-Order Methods in Optimization. SIAM, 2017.
- Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and Francis Bach. Learning with Differentiable Perturbed Optimizers. arXiv e-prints, page arXiv:2002.08676, February 2020.
- Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997.
- Mathieu Blondel. Structured prediction with projection oracles. In Advances in Neural Information Processing Systems, pages 12145–12156, 2019.
- Mathieu Blondel, André FT Martins, and Vlad Niculae. Learning with fenchel-young losses. Journal of Machine Learning Research, 21(35):1–69, 2020.
- Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. Fast differentiable sorting and ranking. arXiv preprint arXiv:2002.08871, 2020.
- James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs, 2018.
- Jianbo Chen, Le Song, Martin Wainwright, and Michael Jordan. Learning to explain: An information-theoretic perspective on model interpretation. In International Conference on Machine Learning, 2018.
- Y.J. Chu and T. H. Liu. On the shortest arborescence of a directed graph. Scientia Sinica, 14:1396–1400, 1965.
- Caio Corro and Ivan Titov. Differentiable perturb-and-parse: Semi-supervised parsing with a structured variational autoencoder. In International Conference on Learning Representations, 2019.
- Josip Djolonga and Andreas Krause. Differentiable learning of submodular models. In Advances in Neural Information Processing Systems, pages 1013–1023, 2017.
- Justin Domke. Implicit differentiation by perturbation. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 523–531. Curran Associates, Inc., 2010.
- Justin Domke. Learning graphical model parameters with approximate marginal inference. IEEE transactions on pattern analysis and machine intelligence, 35(10):2454–2467, 2013.
- John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272–279, 2008.
- Jack Edmonds. Optimum branchings”. Journal of Research of the National Bureau of Standards: Mathematics and mathematical physics. B, 71:233, 1967.
- Thomas MJ Fruchterman and Edward M Reingold. Graph drawing by force-directed placement. Software: Practice and experience, 21(11):1129–1164, 1991.
- Yarin Gal. Uncertainty in deep learning. University of Cambridge, 1:3, 2016.
- Andreea Gane, Tamir Hazan, and Tommi Jaakkola. Learning with maximum a-posteriori perturbation models. In Artificial Intelligence and Statistics, pages 247–256, 2014.
- Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
- Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations, 2018.
- Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Aditya Grover, Eric Wang, Aaron Zweig, and Stefano Ermon. Stochastic optimization of sorting networks via continuous relaxations. In International Conference on Learning Representations, 2019.
- Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropagation for stochastic neural networks. In ICLR, 2016.
- Tamir Hazan and Tommi Jaakkola. On the partition function and random maximum a-posteriori perturbations. In International Conference on Machine Learning, 2012.
- Tamir Hazan, Subhransu Maji, and Tommi Jaakkola. On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations. In Advances in Neural Information Processing Systems, 2013.
- Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2016.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
- Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
- Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural relational inference for interacting systems. In International Conference on Machine Learning, 2018.
- Jon Kleinberg and Éva Tardos. Algorithm Design. Pearson Education, 2006.
- Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. 2009.
- Vladimir Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE transactions on pattern analysis and machine intelligence, 28(10):1568–1583, 2006.
- Terry Koo, Amir Globerson, Xavier Carreras, and Michael Collins. Structured prediction models via the matrix-tree theorem. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), pages 141–150, Prague, Czech Republic, June 2007. Association for Computational Linguistics.
- Wouter Kool, Herke van Hoof, and Max Welling. Ancestral gumbel-top-k sampling for sampling without replacement. Journal of Machine Learning Research, 21(47):1–36, 2020.
- Wouter Kool, Herke van Hoof, and Max Welling. Estimating gradients for discrete random variables by sampling without replacement. In International Conference on Learning Representations, 2020.
- Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48–50, 1956.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- Wonyeol Lee, Hangyeol Yu, and Hongseok Yang. Reparameterization gradient for nondifferentiable models. In Advances in Neural Information Processing Systems, pages 5553–5563, 2018.
- Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. arXiv preprint arXiv:1606.04155, 2016.
- Jun Liu and Jieping Ye. Efficient euclidean projections in linear time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 657–664, 2009.
- Guy Lorberbom, Andreea Gane, Tommi Jaakkola, and Tamir Hazan. Direct optimization through argmax for discrete variational auto-encoder. In Advances in Neural Information Processing Systems, pages 6200–6211, 2019.
- R Duncan Luce. Individual Choice Behavior: A Theoretical Analysis. New York: Wiley, 1959.
- Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017.
- Chris J Maddison, Daniel Tarlow, and Tom Minka. A∗ Sampling. In Advances in Neural Information Processing Systems, 2014.
- Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International Conference on Machine Learning, pages 1614– 1623, 2016.
- André FT Martins and Julia Kreutzer. Learning what’s easy: Fully differentiable neural easyfirst taggers. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 349–362, 2017.
- Julian McAuley, Jure Leskovec, and Dan Jurafsky. Learning attitudes and attributes from multi-aspect reviews. In 2012 IEEE 12th International Conference on Data Mining, pages 1020–1025. IEEE, 2012.
- Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. Learning latent permutations with gumbel-sinkhorn networks. In International Conference on Learning Representations, 2018.
- Elad Mezuman, Daniel Tarlow, Amir Globerson, and Yair Weiss. Tighter linear program relaxations for high order graphical models. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 421–430, 2013.
- Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, 2014.
- Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte Carlo Gradient Estimation in Machine Learning. arXiv e-prints, page arXiv:1906.10652, June 2019.
- Nikita Nangia and Samuel R Bowman. Listops: A diagnostic dataset for latent tree learning. arXiv preprint arXiv:1804.06028, 2018.
- Vlad Niculae, André FT Martins, Mathieu Blondel, and Claire Cardie. Sparsemap: Differentiable sparse structured inference. arXiv preprint arXiv:1802.04223, 2018.
- G. Papandreou and A. Yuille. Perturb-and-MAP Random Fields: Using Discrete Optimization to Learn and Sample from Energy Models. In International Conference on Computer Vision, 2011.
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202, 1975.
- Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 689–690. IEEE, 2011.
- Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
- R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.
- R Tyrrell Rockafellar. Second-order convex analysis. J. Nonlinear Convex Anal, 1(1-16):84, 1999.
- Stephane Ross, Daniel Munoz, Martial Hebert, and J. Andrew Bagnell. Learning messagepassing inference machines for structured prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
- Francisco JR Ruiz, Michalis K Titsias, and David M Blei. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, 2016.
- Alexander Schrijver. Combinatorial optimization: polyhedra and efficiency, volume 24. Springer Science & Business Media, 2003.
- Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.
- Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Kevin Swersky, Ilya Sutskever, Daniel Tarlow, Richard S Zemel, Russ R Salakhutdinov, and Ryan P Adams. Cardinality restricted boltzmann machines. In Advances in neural information processing systems, pages 3293–3301, 2012.
- Daniel Tarlow, Ryan Adams, and Richard Zemel. Randomized optimum models for structured prediction. In Neil D. Lawrence and Mark Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 1221–1229, La Palma, Canary Islands, 21–23 Apr 2012. PMLR.
- Daniel Tarlow, Kevin Swersky, Richard S Zemel, Ryan P Adams, and Brendan J Frey. Fast exact inference for recursive cardinality models. In 28th Conference on Uncertainty in Artificial Intelligence, UAI 2012, pages 825–834, 2012.
- Louis L Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927.
- George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2627–2636, 2017.
- William T. Tutte. Graph Theory. Addison-Wesley, 1984.
- Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Philip Wolfe. Finding the nearest point in a polytope. Mathematical Programming, 11(1):128– 149, 1976.
- Sang Michael Xie and Stefano Ermon. Reparameterizable subset sampling via continuous relaxations. In International Joint Conference on Artificial Intelligence, 2019.
- Mingzhang Yin and Mingyuan Zhou. ARM: Augment-REINFORCE-merge gradient for stochastic binary networks. In International Conference on Learning Representations, 2019.
- 2. If maxx∈X uT x has a unique solution, then lim t→0+
- 1. Since gt is strongly convex [10, Lem. 5.20], (20) has a unique maximum [10, Thm. 5.25].
- 2. First, by Lemma 1, g0∗(u)
- 1. This is clearly a contradiction of our assumption that xm ∈/ conv(X \ {xm}), since the weights in the summation
- 2. Let Ei → {1,...
- 6. By Cor. 1, the procedure of modifying the utilities leaves the distribution of all unpicked edges invariant and sets the utility of the argmax edge to 0.

下载 PDF 全文

标签

评论