## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Categorical Reparameterization with Gumbel-Softmax.

ICLR, (2017)

EI

Keywords

Abstract

Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categori...More

Code:

Data:

Introduction

- Stochastic neural networks with discrete random variables are a powerful technique for representing distributions encountered in unsupervised learning, language modeling, attention mechanisms, and reinforcement learning domains.
- Stochastic networks with discrete variables are difficult to train because the backpropagation algorithm — while permitting efficient computation of parameter gradients — cannot be applied to non-differentiable layers.
- The authors introduce Gumbel-Softmax, a continuous distribution on the simplex that can approximate categorical samples, and whose parameter gradients can be computed via the reparameterization trick

Highlights

- Stochastic neural networks with discrete random variables are a powerful technique for representing distributions encountered in unsupervised learning, language modeling, attention mechanisms, and reinforcement learning domains
- Stochastic networks with discrete variables are difficult to train because the backpropagation algorithm — while permitting efficient computation of parameter gradients — cannot be applied to non-differentiable layers
- Prior work on stochastic gradient estimation has traditionally focused on either score function estimators augmented with Monte Carlo variance reduction techniques (Paisley et al, 2012; Mnih & Gregor, 2014; Gu et al, 2016; Gregor et al, 2013), or biased path derivative estimators for Bernoulli variables (Bengio et al, 2013)
- Each estimator is evaluated on two tasks: (1) structured output prediction and (2) variational training of generative models
- We show that Gumbel-Softmax and Straight-Through Gumbel-Softmax are effective on structured output prediction and variational autoencoder tasks, outperforming existing stochastic gradient estimators for both Bernoulli and categorical latent variables

Results

- In the first set of experiments, the authors compare Gumbel-Softmax and ST Gumbel-Softmax to other stochastic gradient estimators: Score-Function (SF), DARN, MuProp, Straight-Through (ST), and

Slope-Annealed ST. - In the first set of experiments, the authors compare Gumbel-Softmax and ST Gumbel-Softmax to other stochastic gradient estimators: Score-Function (SF), DARN, MuProp, Straight-Through (ST), and.
- Each estimator is evaluated on two tasks: (1) structured output prediction and (2) variational training of generative models.
- The authors use the MNIST dataset with fixed binarization for training and evaluation, which is common practice for evaluating stochastic gradient estimators (Salakhutdinov & Murray, 2008; Larochelle & Murray, 2011).
- Models were trained using stochastic gradient descent with momentum 0.9

Conclusion

- The primary contribution of this work is the reparameterizable Gumbel-Softmax distribution, whose corresponding estimator affords low-variance path derivative gradients for the categorical distribution.
- The authors show that Gumbel-Softmax and Straight-Through Gumbel-Softmax are effective on structured output prediction and variational autoencoder tasks, outperforming existing stochastic gradient estimators for both Bernoulli and categorical latent variables.
- Gumbel-Softmax enables dramatic speedups in inference over discrete latent variables

Summary

- Stochastic neural networks with discrete random variables are a powerful technique for representing distributions encountered in unsupervised learning, language modeling, attention mechanisms, and reinforcement learning domains.
- Stochastic networks with discrete variables are difficult to train because the backpropagation algorithm — while permitting efficient computation of parameter gradients — cannot be applied to non-differentiable layers.
- We introduce Gumbel-Softmax, a continuous distribution on the simplex that can approximate categorical samples, and whose parameter gradients can be computed via the reparameterization trick.
- 2. We show experimentally that Gumbel-Softmax outperforms all single-sample gradient estimators on both Bernoulli variables and categorical variables.
- 3. We show that this estimator can be used to efficiently train semi-supervised models (e.g. Kingma et al (2014)) without costly marginalization over unobserved categorical latent variables.
- We use the softmax function as a continuous, differentiable approximation to arg max, and generate k-dimensional sample vectors y ∈ ∆k−1 where yi =
- This procedure of replacing non-differentiable categorical samples with a differentiable approximation during training as the Gumbel-Softmax estimator.
- While Gumbel-Softmax samples are differentiable, they are not identical to samples from the corresponding categorical distribution for non-zero temperature.
- (5) Gumbel-Softmax is a path derivative estimator for a continuous distribution y that approximates z.
- Gumbel-Softmax allows us to backpropagate through y ∼ qφ(y|x) for single sample gradient estimation, and achieves a cost of O(D + I + G) per training step.
- In our first set of experiments, we compare Gumbel-Softmax and ST Gumbel-Softmax to other stochastic gradient estimators: Score-Function (SF), DARN, MuProp, Straight-Through (ST), and
- Samples drawn from the Gumbel-Softmax distribution are continuous during training, but are discretized to one-hot vectors during evaluation.
- As shown in Figure 3, ST Gumbel-Softmax is on par with the other estimators for Bernoulli variables and outperforms on categorical variables.
- Gumbel-Softmax outperforms other estimators on both Bernoulli and Categorical variables.
- We train variational autoencoders (Kingma & Welling, 2013), where the objective is to learn a generative model of binary MNIST images.
- We use a learned categorical prior rather than a Gumbel-Softmax prior in the training objective.
- Gumbel-Softmax allows us to backpropagate directly through single samples from the joint qφ(y, z|x), achieving drastic speedups in training without compromising generative or classification performance.
- The primary contribution of this work is the reparameterizable Gumbel-Softmax distribution, whose corresponding estimator affords low-variance path derivative gradients for the categorical distribution.
- We show that Gumbel-Softmax and Straight-Through Gumbel-Softmax are effective on structured output prediction and variational autoencoder tasks, outperforming existing stochastic gradient estimators for both Bernoulli and categorical latent variables.

- Table1: The Gumbel-Softmax estimator outperforms other estimators on Bernoulli and Categorical latent variables. For the structured output prediction (SBN) task, numbers correspond to negative log-likelihoods (nats) of input images (lower is better). For the VAE task, numbers correspond to negative variational lower bounds (nats) on the log-likelihood (lower is better)
- Table2: Marginalizing over y and single-sample variational inference perform equally well when applied to image classification on the binarized MNIST dataset (<a class="ref-link" id="cLarochelle_2011_a" href="#rLarochelle_2011_a">Larochelle & Murray, 2011</a>). We report variational lower bounds and image classification accuracy for unlabeled data in the test set

Related work

- In this section we review existing stochastic gradient estimation techniques for discrete variables (illustrated in Figure 2). Consider a stochastic computation graph (Schulman et al, 2015) with discrete random variable z whose distribution depends on parameter θ, and cost function f (z). The objective is to minimize the expected cost L(θ) = Ez∼pθ(z)[f (z)] via gradient descent, which requires us to estimate ∇θEz∼pθ(z)[f (z)].

3.1 PATH DERIVATIVE GRADIENT ESTIMATORS

For distributions that are reparameterizable, we can compute the sample z as a deterministic function g of the parameters θ and an independent random variable , so that z = g(θ, ). The path-wise gradients from f to θ can then be computed without encountering any stochastic nodes: ∂ ∂f ∂g

∂θ Ez∼pθ [f (z))] = ∂θ E [f (g(θ, ))] = E ∼p ∂g ∂θ (4)

For example, the normal distribution z ∼ N (μ, σ) can be re-written as μ + σ · N (0, 1), making it trivial to compute ∂z/∂μ and ∂z/∂σ. This reparameterization trick is commonly applied to training variational autooencoders with continuous latent variables using backpropagation (Kingma & Welling, 2013; Rezende et al, 2014b). As shown in Figure 2, we exploit such a trick in the construction of the Gumbel-Softmax estimator.

Funding

- Presents an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution
- Introduces Gumbel-Softmax, a continuous distribution on the simplex that can approximate categorical samples, and whose parameter gradients can be computed via the reparameterization trick
- Shows experimentally that Gumbel-Softmax outperforms all single-sample gradient estimators on both Bernoulli variables and categorical variables
- Shows that this estimator can be used to efficiently train semi-supervised models ) without costly marginalization over unobserved categorical latent variables

Reference

- Y. Bengio, N. Leonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. CoRR, abs/1606.03657, 2016.
- J. Chung, S. Ahn, and Y. Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
- P. W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
- A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwinska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
- Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.
- K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. Deep autoregressive networks. arXiv preprint arXiv:1310.8499, 2013.
- S. Gu, S. Levine, I. Sutskever, and A Mnih. MuProp: Unbiased Backpropagation for Stochastic Neural Networks. ICLR, 2016.
- E. J. Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures. Number 33. US Govt. Print. Office, 1954.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
- H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In AISTATS, volume 1, pp. 2, 2011.
- C. J. Maddison, D. Tarlow, and T. Minka. A* sampling. In Advances in Neural Information Processing Systems, pp. 3086–3094, 2014.
- C. J. Maddison, A. Mnih, and Y. Whye Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. ArXiv e-prints, November 2016.
- A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. ICML, 31, 2014.
- A. Mnih and D. J. Rezende. Variational inference for monte carlo objectives. arXiv preprint arXiv:1602.06725, 2016.
- J. Paisley, D. Blei, and M. Jordan. Variational Bayesian Inference with Stochastic Search. ArXiv e-prints, June 2012.
- Gabriel Pereyra, Geoffrey Hinton, George Tucker, and Lukasz Kaiser. Regularizing neural networks by penalizing confident output distributions. 2016.
- J. W Rae, J. J Hunt, T. Harley, I. Danihelka, A. Senior, G. Wayne, A. Graves, and T. P Lillicrap. Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes. ArXiv e-prints, October 2016.
- T. Raiko, M. Berglund, G. Alain, and L. Dinh. Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.
- D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014a.
- D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of The 31st International Conference on Machine Learning, pp. 1278–1286, 2014b.
- J. T. Rolfe. Discrete Variational Autoencoders. ArXiv e-prints, September 2016. R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pp. 872–879. ACM, 2008.
- J. Schulman, N. Heess, T. Weber, and P. Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pp. 3528–3536, 2015.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn