# DisARM: An Antithetic Gradient Estimator for Binary Latent Variables

NIPS 2020, 2020.

Weibo:

Abstract:

Training models with discrete latent variables is challenging due to the difficulty of estimating the gradients accurately. Much of the recent progress has been achieved by taking advantage of continuous relaxations of the system, which are not always available or even possible. The Augment-REINFORCE-Merge (ARM) estimator provides an alte...More

Code:

Data:

Introduction

- The authors often require the gradient of an expectation with respect to the parameters of the distribution.
- In all but the simplest settings, the expectation is analytically intractable and the gradient is estimated using Monte Carlo sampling
- This problem is encountered, for example, in modern variational inference, where the authors would like to maximize a variational lower bound with respect to the parameters of the variational posterior.
- The Bernoulli distribution is not a location-scale distribution, so this result is not immediately applicable

Highlights

- We often require the gradient of an expectation with respect to the parameters of the distribution
- We have introduced DisARM, an unbiased, low-variance gradient estimator for Bernoulli random variables based on antithetic sampling
- Our starting point was the ARM estimator (Yin and Zhou, 2019), which reparameterizes Bernoulli variables in terms of Logistic variables and estimates the REINFORCE gradient over the Logistic variables using antithetic sampling
- Our key insight is that the ARM estimator involves unnecessary randomness because it operates on the augmenting Logistic variables instead of the original Bernoulli ones
- ARM is competitive despite rather than because of the Logistic augmentation step, and its low variance is completely due to the use of antithetic sampling
- We could naïvely apply DisARM to the multi-sample objective, our preliminary experiments did not suggest this improved performance over VIMCO
- We derive DisARM by integrating out the augmenting variables from ARM using a variance reduction technique known as conditioning

Results

- The authors' goal was variance reduction to improve optimization, so the authors compare DisARM to the state-of-theart methods: ARM (Yin and Zhou, 2019) and RELAX (Grathwohl et al, 2018) for the general case and VIMCO (Mnih and Rezende, 2016) for the multi-sample variational bound.
- The authors include a two-independent-sample REINFORCE estimator with a leave-one-out baseline (REINFORCE LOO, Kool et al, 2019).
- This is a simple, but competitive method that has been omitted from previous works.

Conclusion

- The authors have introduced DisARM, an unbiased, low-variance gradient estimator for Bernoulli random variables based on antithetic sampling.
- The authors' key insight is that the ARM estimator involves unnecessary randomness because it operates on the augmenting Logistic variables instead of the original Bernoulli ones.
- ARM is competitive despite rather than because of the Logistic augmentation step, and its low variance is completely due to the use of antithetic sampling.
- Given DisARM’s generality and simplicity, the authors expect it to be widely useful

Summary

## Introduction:

The authors often require the gradient of an expectation with respect to the parameters of the distribution.- In all but the simplest settings, the expectation is analytically intractable and the gradient is estimated using Monte Carlo sampling
- This problem is encountered, for example, in modern variational inference, where the authors would like to maximize a variational lower bound with respect to the parameters of the variational posterior.
- The Bernoulli distribution is not a location-scale distribution, so this result is not immediately applicable
## Objectives:

To fit the parameters of a discrete latent variable model pθ(x, b), the authors can lower bound the log marginal likelihood log pθ(x) ≥ Eqθ(b|x) [log pθ(x, b) − log qθ(b|x)], where qθ(b|x) is a variational distribution.- The authors omit the dependence of w on θ because it is straightforward to account for
- In this case, Mnih and Rezende (2016) introduced a gradient estimator, VIMCO, that uses specialized control variates that take advantage of the structure of the objective k j w(bj ).
## Results:

The authors' goal was variance reduction to improve optimization, so the authors compare DisARM to the state-of-theart methods: ARM (Yin and Zhou, 2019) and RELAX (Grathwohl et al, 2018) for the general case and VIMCO (Mnih and Rezende, 2016) for the multi-sample variational bound.- The authors include a two-independent-sample REINFORCE estimator with a leave-one-out baseline (REINFORCE LOO, Kool et al, 2019).
- This is a simple, but competitive method that has been omitted from previous works.
## Conclusion:

The authors have introduced DisARM, an unbiased, low-variance gradient estimator for Bernoulli random variables based on antithetic sampling.- The authors' key insight is that the ARM estimator involves unnecessary randomness because it operates on the augmenting Logistic variables instead of the original Bernoulli ones.
- ARM is competitive despite rather than because of the Logistic augmentation step, and its low variance is completely due to the use of antithetic sampling.
- Given DisARM’s generality and simplicity, the authors expect it to be widely useful

- Table1: Mean variational lower bounds and the standard error of the mean computed based on 5 runs from different random initializations. The best performing method (up to the standard error) for each task is in bold. To provide a computationally fair comparison between VIMCO 2K-samples and DisARM K-pairs, we report the 2K-sample bound for both, even though DisARM optimizes the K-sample bound
- Table2: Results for models trained by maximizing the ELBO. We report the mean and the standard error of the mean for the ELBO on the training set and of the 100-sample bound on the test set. The results we computed based on 5 runs from different random initializations and the standard error of the mean. The best performing method (up to the standard error) for each task is in bold
- Table3: Train and test variational lower bounds for models trained using the multi-sample objective. We report the mean and the standard error of the mean computed based on 5 runs from different random initializations. The best performing method (up to the standard error) for each task is in bold. To provide a computationally fair comparison between VIMCO 2K-samples and DisARM K-pairs, we report the 2K-sample bound for both, even though DisARM optimizes the K-sample bound

Related work

- Virtually all unbiased gradient estimators for discrete variables in machine learning are variants of the score function (SF) estimator (Fu, 2006), also known as REINFORCE or the likelihood-ratio estimator. As the naive SF estimator tends to have high variance, these estimators differ in the variance reduction techniques they employ. The most widely used of these techniques are control variates (Owen, 2013). Constant multiples of the score function itself are the most widely used control variates, known as baselines.2 The original formulation of REINFORCE (Williams, 1992) already included a baseline, as did its earliest specializations to variational inference (Paisley et al, 2012; Wingate and Weber, 2013; Ranganath et al, 2014; Mnih and Gregor, 2014). When the function f (b) is differentiable, more sophisticated control variates can be obtained by incorporating the gradient of f . MuProp (Gu et al, 2016) takes the “mean field” approach by evaluating the gradient at the means of the latent variables, while REBAR (Tucker et al, 2017) obtains the gradient by applying the Gumbel-Softmax / Concrete relaxation (Jang et al, 2017; Maddison et al, 2017) to the latent variables and then using the reparameterization trick. RELAX (Grathwohl et al, 2018) extends REBAR by augmenting it with a free-form control variate. In principle, RELAX does not require a continuous relaxation, however, in practice, the strong performance previously reported relies on the continuous relaxation.

Reference

- Bengio, Y., Léonard, N., and Courville, A. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
- Buesing, L., Weber, T., and Mohamed, S. (2016). Stochastic gradient estimation with finite differences. In NIPS2016 Workshop on Advances in Approximate Inference.
- Burda, Y., Grosse, R., and Salakhutdinov, R. (2016). Importance weighted autoencoders. In Proceedings of the 4th International Conference on Learning Representations.
- Fu, M. C. (2006). Gradient estimation. Handbooks in operations research and management science, 13:575–616.
- Glynn, P. W. (1990). Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84.
- Grathwohl, W., Choi, D., Wu, Y., Roeder, G., and Duvenaud, D. (2018). Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations.
- Gu, S., Levine, S., Sutskever, I., and Mnih, A. (2016). MuProp: Unbiased backpropagation for stochastic neural networks. In International Conference on Learning Representations.
- Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations.
- Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2):183–233.
- Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations.
- Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In International Conference on Learning Representations.
- Kool, W., van Hoof, H., and Welling, M. (2019). Buy 4 reinforce samples, get a baseline for free! In Deep RL Meets Structured Prediction ICLR Workshop.
- Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In In ICML Workshop on Deep Learning for Audio, Speech and Language Processing.
- Maddison, C. J., Mnih, A., and Teh, Y. W. (2017). The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In International Conference on Learning Representations.
- Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. In Proceedings of The 31st International Conference on Machine Learning, pages 1791–1799.
- Mnih, A. and Rezende, D. (2016). Variational inference for monte carlo objectives. In Proceedings of The 33rd International Conference on Machine Learning, pages 2188–2196.
- Owen, A. B. (2013). Monte Carlo theory, methods and examples.
- Paisley, J., Blei, D. M., and Jordan, M. I. (2012). Variational bayesian inference with stochastic search. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1363–1370.
- Ranganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In AISTATS, pages 814–822.
- Ren, H., Zhao, S., and Ermon, S. (2019). Adaptive antithetic sampling for variance reduction. In Proceedings of the 36th International Conference on Machine Learning.
- Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of The 31st International Conference on Machine Learning, pages 1278–1286.
- Titsias, M. K. and Lázaro-Gredilla, M. (2015). Local expectation gradients for black box variational inference. In Advances in Neural Information Processing Systems, pages 2638–2646.
- Tucker, G., Mnih, A., Maddison, C. J., Lawson, J., and Sohl-Dickstein, J. (2017). REBAR: Lowvariance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems 30.
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
- Wingate, D. and Weber, T. (2013). Automated variational inference in probabilistic programming. arXiv preprint arXiv:1301.1299.
- Wu, M., Goodman, N., and Ermon, S. (2019). Differentiable antithetic sampling for variance reduction in stochastic variational inference. In AISTATS. Yin, M., Yue, Y., and Zhou, M. (2019). ARSM: Augment-REINFORCE-swap-merge estimator for gradient backpropagation through categorical variables. In Proceedings of the 36th International Conference on Machine Learning.
- Yin, M. and Zhou, M. (2019). ARM: Augment-REINFORCE-merge gradient for stochastic binary networks. In International Conference on Learning Representations.
- Input images to the networks were centered with the global mean of the training dataset. For the nonlinear network activations, we used leaky rectified linear units (LeakyReLU, Maas et al., 2013)
- activations with the negative slope coefficient of 0.3 as in (Yin and Zhou, 2019). The parameters of the inference and generation networks were optimized with Adam (Kingma and Ba, 2015) using learning rate 10−4. The logits for the prior distribution p(b) were optimized using SGD with learning rate 10−2 as in (Yin and Zhou, 2019). For RELAX, we initialize the trainable temperature and scaling factor of the control variate to 0.1 and 1.0, respectively. The learned control variate in RELAX was a single-hidden-layer neural network with 137 LeakyReLU units. The control variate parameters were also optimized with Adam using learning rate 10−4.

Full Text

Tags

Comments