AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We show that Gumbel-Softmax and Straight-Through Gumbel-Softmax are effective on structured output prediction and variational autoencoder tasks, outperforming existing stochastic gradient estimators for both Bernoulli and categorical latent variables

Categorical Reparameterization with Gumbel-Softmax.

ICLR, (2017)

Cited by: 0|Views36
EI
Full Text
Bibtex
Weibo

Abstract

Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categori...More

Code:

Data:

0
Introduction
  • Stochastic neural networks with discrete random variables are a powerful technique for representing distributions encountered in unsupervised learning, language modeling, attention mechanisms, and reinforcement learning domains.
  • Stochastic networks with discrete variables are difficult to train because the backpropagation algorithm — while permitting efficient computation of parameter gradients — cannot be applied to non-differentiable layers.
  • The authors introduce Gumbel-Softmax, a continuous distribution on the simplex that can approximate categorical samples, and whose parameter gradients can be computed via the reparameterization trick
Highlights
  • Stochastic neural networks with discrete random variables are a powerful technique for representing distributions encountered in unsupervised learning, language modeling, attention mechanisms, and reinforcement learning domains
  • Stochastic networks with discrete variables are difficult to train because the backpropagation algorithm — while permitting efficient computation of parameter gradients — cannot be applied to non-differentiable layers
  • Prior work on stochastic gradient estimation has traditionally focused on either score function estimators augmented with Monte Carlo variance reduction techniques (Paisley et al, 2012; Mnih & Gregor, 2014; Gu et al, 2016; Gregor et al, 2013), or biased path derivative estimators for Bernoulli variables (Bengio et al, 2013)
  • Each estimator is evaluated on two tasks: (1) structured output prediction and (2) variational training of generative models
  • We show that Gumbel-Softmax and Straight-Through Gumbel-Softmax are effective on structured output prediction and variational autoencoder tasks, outperforming existing stochastic gradient estimators for both Bernoulli and categorical latent variables
Results
  • In the first set of experiments, the authors compare Gumbel-Softmax and ST Gumbel-Softmax to other stochastic gradient estimators: Score-Function (SF), DARN, MuProp, Straight-Through (ST), and

    Slope-Annealed ST.
  • In the first set of experiments, the authors compare Gumbel-Softmax and ST Gumbel-Softmax to other stochastic gradient estimators: Score-Function (SF), DARN, MuProp, Straight-Through (ST), and.
  • Each estimator is evaluated on two tasks: (1) structured output prediction and (2) variational training of generative models.
  • The authors use the MNIST dataset with fixed binarization for training and evaluation, which is common practice for evaluating stochastic gradient estimators (Salakhutdinov & Murray, 2008; Larochelle & Murray, 2011).
  • Models were trained using stochastic gradient descent with momentum 0.9
Conclusion
  • The primary contribution of this work is the reparameterizable Gumbel-Softmax distribution, whose corresponding estimator affords low-variance path derivative gradients for the categorical distribution.
  • The authors show that Gumbel-Softmax and Straight-Through Gumbel-Softmax are effective on structured output prediction and variational autoencoder tasks, outperforming existing stochastic gradient estimators for both Bernoulli and categorical latent variables.
  • Gumbel-Softmax enables dramatic speedups in inference over discrete latent variables
Summary
  • Stochastic neural networks with discrete random variables are a powerful technique for representing distributions encountered in unsupervised learning, language modeling, attention mechanisms, and reinforcement learning domains.
  • Stochastic networks with discrete variables are difficult to train because the backpropagation algorithm — while permitting efficient computation of parameter gradients — cannot be applied to non-differentiable layers.
  • We introduce Gumbel-Softmax, a continuous distribution on the simplex that can approximate categorical samples, and whose parameter gradients can be computed via the reparameterization trick.
  • 2. We show experimentally that Gumbel-Softmax outperforms all single-sample gradient estimators on both Bernoulli variables and categorical variables.
  • 3. We show that this estimator can be used to efficiently train semi-supervised models (e.g. Kingma et al (2014)) without costly marginalization over unobserved categorical latent variables.
  • We use the softmax function as a continuous, differentiable approximation to arg max, and generate k-dimensional sample vectors y ∈ ∆k−1 where yi =
  • This procedure of replacing non-differentiable categorical samples with a differentiable approximation during training as the Gumbel-Softmax estimator.
  • While Gumbel-Softmax samples are differentiable, they are not identical to samples from the corresponding categorical distribution for non-zero temperature.
  • (5) Gumbel-Softmax is a path derivative estimator for a continuous distribution y that approximates z.
  • Gumbel-Softmax allows us to backpropagate through y ∼ qφ(y|x) for single sample gradient estimation, and achieves a cost of O(D + I + G) per training step.
  • In our first set of experiments, we compare Gumbel-Softmax and ST Gumbel-Softmax to other stochastic gradient estimators: Score-Function (SF), DARN, MuProp, Straight-Through (ST), and
  • Samples drawn from the Gumbel-Softmax distribution are continuous during training, but are discretized to one-hot vectors during evaluation.
  • As shown in Figure 3, ST Gumbel-Softmax is on par with the other estimators for Bernoulli variables and outperforms on categorical variables.
  • Gumbel-Softmax outperforms other estimators on both Bernoulli and Categorical variables.
  • We train variational autoencoders (Kingma & Welling, 2013), where the objective is to learn a generative model of binary MNIST images.
  • We use a learned categorical prior rather than a Gumbel-Softmax prior in the training objective.
  • Gumbel-Softmax allows us to backpropagate directly through single samples from the joint qφ(y, z|x), achieving drastic speedups in training without compromising generative or classification performance.
  • The primary contribution of this work is the reparameterizable Gumbel-Softmax distribution, whose corresponding estimator affords low-variance path derivative gradients for the categorical distribution.
  • We show that Gumbel-Softmax and Straight-Through Gumbel-Softmax are effective on structured output prediction and variational autoencoder tasks, outperforming existing stochastic gradient estimators for both Bernoulli and categorical latent variables.
Tables
  • Table1: The Gumbel-Softmax estimator outperforms other estimators on Bernoulli and Categorical latent variables. For the structured output prediction (SBN) task, numbers correspond to negative log-likelihoods (nats) of input images (lower is better). For the VAE task, numbers correspond to negative variational lower bounds (nats) on the log-likelihood (lower is better)
  • Table2: Marginalizing over y and single-sample variational inference perform equally well when applied to image classification on the binarized MNIST dataset (<a class="ref-link" id="cLarochelle_2011_a" href="#rLarochelle_2011_a">Larochelle & Murray, 2011</a>). We report variational lower bounds and image classification accuracy for unlabeled data in the test set
Download tables as Excel
Related work
  • In this section we review existing stochastic gradient estimation techniques for discrete variables (illustrated in Figure 2). Consider a stochastic computation graph (Schulman et al, 2015) with discrete random variable z whose distribution depends on parameter θ, and cost function f (z). The objective is to minimize the expected cost L(θ) = Ez∼pθ(z)[f (z)] via gradient descent, which requires us to estimate ∇θEz∼pθ(z)[f (z)].

    3.1 PATH DERIVATIVE GRADIENT ESTIMATORS

    For distributions that are reparameterizable, we can compute the sample z as a deterministic function g of the parameters θ and an independent random variable , so that z = g(θ, ). The path-wise gradients from f to θ can then be computed without encountering any stochastic nodes: ∂ ∂f ∂g

    ∂θ Ez∼pθ [f (z))] = ∂θ E [f (g(θ, ))] = E ∼p ∂g ∂θ (4)

    For example, the normal distribution z ∼ N (μ, σ) can be re-written as μ + σ · N (0, 1), making it trivial to compute ∂z/∂μ and ∂z/∂σ. This reparameterization trick is commonly applied to training variational autooencoders with continuous latent variables using backpropagation (Kingma & Welling, 2013; Rezende et al, 2014b). As shown in Figure 2, we exploit such a trick in the construction of the Gumbel-Softmax estimator.
Funding
  • Presents an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution
  • Introduces Gumbel-Softmax, a continuous distribution on the simplex that can approximate categorical samples, and whose parameter gradients can be computed via the reparameterization trick
  • Shows experimentally that Gumbel-Softmax outperforms all single-sample gradient estimators on both Bernoulli variables and categorical variables
  • Shows that this estimator can be used to efficiently train semi-supervised models ) without costly marginalization over unobserved categorical latent variables
Reference
  • Y. Bengio, N. Leonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
    Findings
  • Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. CoRR, abs/1606.03657, 2016.
    Findings
  • J. Chung, S. Ahn, and Y. Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
    Findings
  • P. W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
    Google ScholarLocate open access versionFindings
  • A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwinska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
    Google ScholarLocate open access versionFindings
  • Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.
    Findings
  • K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. Deep autoregressive networks. arXiv preprint arXiv:1310.8499, 2013.
    Findings
  • S. Gu, S. Levine, I. Sutskever, and A Mnih. MuProp: Unbiased Backpropagation for Stochastic Neural Networks. ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • E. J. Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures. Number 33. US Govt. Print. Office, 1954.
    Google ScholarFindings
  • D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
    Findings
  • D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
    Google ScholarLocate open access versionFindings
  • H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In AISTATS, volume 1, pp. 2, 2011.
    Google ScholarLocate open access versionFindings
  • C. J. Maddison, D. Tarlow, and T. Minka. A* sampling. In Advances in Neural Information Processing Systems, pp. 3086–3094, 2014.
    Google ScholarLocate open access versionFindings
  • C. J. Maddison, A. Mnih, and Y. Whye Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. ArXiv e-prints, November 2016.
    Google ScholarFindings
  • A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. ICML, 31, 2014.
    Google ScholarLocate open access versionFindings
  • A. Mnih and D. J. Rezende. Variational inference for monte carlo objectives. arXiv preprint arXiv:1602.06725, 2016.
    Findings
  • J. Paisley, D. Blei, and M. Jordan. Variational Bayesian Inference with Stochastic Search. ArXiv e-prints, June 2012.
    Google ScholarFindings
  • Gabriel Pereyra, Geoffrey Hinton, George Tucker, and Lukasz Kaiser. Regularizing neural networks by penalizing confident output distributions. 2016.
    Google ScholarFindings
  • J. W Rae, J. J Hunt, T. Harley, I. Danihelka, A. Senior, G. Wayne, A. Graves, and T. P Lillicrap. Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes. ArXiv e-prints, October 2016.
    Google ScholarFindings
  • T. Raiko, M. Berglund, G. Alain, and L. Dinh. Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.
    Findings
  • D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014a.
    Findings
  • D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of The 31st International Conference on Machine Learning, pp. 1278–1286, 2014b.
    Google ScholarLocate open access versionFindings
  • J. T. Rolfe. Discrete Variational Autoencoders. ArXiv e-prints, September 2016. R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pp. 872–879. ACM, 2008.
    Google ScholarLocate open access versionFindings
  • J. Schulman, N. Heess, T. Weber, and P. Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pp. 3528–3536, 2015.
    Google ScholarLocate open access versionFindings
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
    Findings
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015.
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科