AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We have presented a new approach for automated selection of the integration schedule for the Thermodynamic Variational Objective
Gaussian Process Bandit Optimization of the Thermodynamic Variational Objective
NIPS 2020, (2020)
Achieving the full promise of the Thermodynamic Variational Objective (TVO),a recently proposed variational lower bound on the log evidence involving a one-dimensional Riemann integral approximation, requires choosing a "schedule" ofsorted discretization points. This paper introduces a bespoke Gaussian processbandit optimization method ...More
PPT (Upload PPT)
- The Variational Autoencoder (VAE) framework has formed the basis for a number of recent advances in unsupervised representation learning [17, 35, 41].
- The VAE framework introduces an inference network, which seeks to approximate the true posterior over latent variables.
- The authors build upon the recent Thermodynamic Variational Objective (TVO), which frames log-likelihood estimation as a one-dimensional integral over the unit interval .
- The integral is estimated using a Riemann sum approximation, as visualized in Figure 1, yielding a natural family of variational inference objectives which generalize and tighten the ELBO
- The Variational Autoencoder (VAE) framework has formed the basis for a number of recent advances in unsupervised representation learning [17, 35, 41]
- In Appendix D, we explore learning and inference in a discrete probabilistic contextfree grammar , showing that the Thermodynamic Variational Objective (TVO) objective and our bandit optimization can translate to other learning settings
- We have presented a new approach for automated selection of the integration schedule for the Thermodynamic Variational Objective
- Our bandit framework optimizes a reward function that is directly linked to improvements in the generative model evidence over the course of training the model parameters
- We show theoretically that this procedure asymptotically minimizes the regret as a function of the choice of schedule
- Our algorithm, as well as all other existing schedules, still rely on the number of partitions d as a hyperparameter which is fixed over the course of the training
- The authors demonstrate the effectiveness of the method for training VAEs  on MNIST and Fashion MNIST, and a Sigmoid Belief Network  on binarized MNIST and binarized Omniglot, using the TVO objective.
- In Appendix D, the authors explore learning and inference in a discrete probabilistic contextfree grammar , showing that the TVO objective and the bandit optimization can translate to other learning settings.
- The authors' code is available at http://github.com/ntienvu/tvo_gp_bandit
- The authors have presented a new approach for automated selection of the integration schedule for the Thermodynamic Variational Objective.
- The authors' bandit framework optimizes a reward function that is directly linked to improvements in the generative model evidence over the course of training the model parameters.
- The authors demonstrated that the proposed approach empirically outperforms existing schedules in both model learning and inference for discrete and continuous generative models.
- The authors' GP bandit optimization offers a general solution to choosing the integration schedule in the TVO.
- Incorporating the adaptive selection of d into the bandit optimization remains an interesting direction for future work
- Table1: Supporting notations in regret analysis. We use notation to similar to Appendix C of Bogunovic et al [<a class="ref-link" id="c5" href="#r5">5</a>] when possible
- Table2: Wallclock time of the GP-bandit schedule compared to the grid-search of [<a class="ref-link" id="c26" href="#r26">26</a>] for the log schedule. GP-bandit approach achieves a competitive test log likelihood and lower KL divergence compared with the grid-searched log schedule, but requires significantly lower cumulative run-time
- Table3: Comparison between permutation invariant and non-permutation invariant in MNIST dataset using S=10 (top) and S=50 (bottom). The best scores are in bold. Given T used epochs, the number of bandit update and thus the number of sample for GP is T /w where w = 10 is the frequency update. The permutation invariant will be more favorable when we have less samples for fitting the GP, as indicated in less number of used epochs T = 1000, 2000. The performance is comparable when we collect sufficiently large number of samples, e.g., when T /w = 1000
- VM acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) under award number PGSD3-535575-2019 and the British Columbia Graduate Scholarship, award number 6768
- VM/FW acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada CIFAR AI Chairs Program, and the Intel Parallel Computing Centers program
- RB acknowledges support from the Defense Advanced Research Projects Agency (DARPA) under award FA8750-17-C-0106. This material is based upon work supported by the United States Air Force Research Laboratory (AFRL) under the Defense Advanced Research Projects Agency (DARPA) Data Driven Discovery Models (D3M) program (Contract No FA8750-19-2-0222) and Learning with Less Labels (LwLL) program (Contract No.FA875019C0515)
- Additional support was provided by UBC’s Composites Research Network (CRN), Data Science Institute (DSI) and Support for Teams to Advance Interdisciplinary Research (STAIR) Grants
Study subjects and analysis
Continuous VAE: We present results of training a continuous VAE on the MNIST and Fashion MNIST dataset in Figure 4. We measure model learning performance using the test log evidence, as estimated by the IWAE bound  with 5000 samples per data point. We also compare inference performance using DKL[qφ(z | x) ||pθ(z | x)], which we calculate by subtracting the test ELBO from our estimate of log pθ(x)
- Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- Atilim Gunes Baydin, Lei Shao, Wahid Bhimji, Lukas Heinrich, Saeid Naderiparizi, Andreas Munk, Jialin Liu, Bradley Gram-Hansen, Gilles Louppe, Lawrence Meadows, et al. Efficient probabilistic inference in the quest for physics beyond the standard model. In Advances in Neural Information Processing Systems, pages 5460–5473, 2019.
- Christopher M Bishop. Pattern recognition and machine learning. springer New York, 2006.
- David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
- Ilija Bogunovic, Jonathan Scarlett, and Volkan Cevher. Time-varying Gaussian process bandit optimization. In Artificial Intelligence and Statistics, pages 314–323, 2016.
- Rob Brekelmans, Vaden Masrani, Frank Wood, Greg Ver Steeg, and Aram Galstyan. All in the exponential family: Bregman duality in thermodynamic variational inference. In International Conference on Machine Learning, 2020.
- Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In International Conference on Representation Learning, 2016.
- Chris Cremer, Xuechen Li, and David Duvenaud. Inference suboptimality in variational autoencoders. In International Conference on Machine Learning, pages 1078–1086, 2018.
- Roger Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.
- Daan Frenkel and Berend Smit. Understanding molecular simulation: from algorithms to applications, volume 1.
- Andrew Gelman and Xiao-Li Meng. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical science, pages 163–185, 1998.
- Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging inference networks and posterior collapse in variational autoencoders. In International Conference on Representation Learning, 2019.
- Philipp Hennig and Christian J Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13:1809–1837, 2012.
- José Miguel Hernández-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems, pages 918–926, 2014.
- Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2014.
- Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
- Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
- Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast Bayesian optimization of machine learning hyperparameters on large datasets. In Artificial Intelligence and Statistics, pages 528–536, 2017.
- Andreas Krause and Cheng S Ong. Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455, 2011.
- Brenden M Lake, Russ R Salakhutdinov, and Josh Tenenbaum. One-shot learning by inverting a compositional causal process. In Advances in Neural Information Processing Systems, pages 2526–2534, 2013.
- Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- Tuan Anh Le, Adam R Kosiorek, N Siddharth, Yee Whye Teh, and Frank Wood. Revisiting reweighted wake-sleep for models with stochastic control flow. In Uncertainty in Artificial Intelligence, pages 1039–1049. PMLR, 2020.
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Yucen Luo, Alex Beatson, Mohammad Norouzi, Jun Zhu, David Duvenaud, Ryan P Adams, and Ricky TQ Chen. Sumo: Unbiased estimation of log marginal probability for latent variable models. arXiv preprint arXiv:2004.00353, 2020.
- Vaden Masrani, Tuan Anh Le, and Frank Wood. The thermodynamic variational objective. In Advances in Neural Information Processing Systems, pages 11521–11530, 2019.
- Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, pages 1791–1799, 2014.
- Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. In International Conference on Machine Learning, pages 2188–2196, 2016.
- Vu Nguyen, Sebastian Schulze, and Michael A Osborne. Bayesian optimization for iterative learning. In Advances in Neural Information Processing Systems, 2020.
- Sebastian Nowozin. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. International Conference on Learning Representations, 2018.
- Favour Mandanji Nyikosa. Adaptive Bayesian optimization for dynamic problems. PhD thesis, University of Oxford, 2018.
- Yosihiko Ogata. A monte carlo method for high dimensional integration. Numerische Mathematik, 55(2):137–157, 1989.
- Carl Edward Rasmussen. Gaussian processes for machine learning. 2006.
- Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538, 2015.
- Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014.
- Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In International Conference on Machine Learning, pages 872–879, 2008.
- Jasper Snoek, Kevin Swersky, Rich Zemel, and Ryan Adams. Input warping for Bayesian optimization of non-stationary functions. In International Conference on Machine Learning, pages 1674–1682, 2014.
- Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in Neural Information Processing Systems 29, pages 3738–3746. 2016.
- Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, pages 1015–1022, 2010.
- Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task Bayesian optimization. In Advances in Neural Information Processing Systems, pages 2004–2012, 2013.
- Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069, 2018.
- Mark van der Wilk, Matthias Bauer, ST John, and James Hensman. Learning invariances using the marginal likelihood. In Advances in Neural Information Processing Systems, pages 9938–9948, 2018.
- Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1-2):1–305, 2008.
- Frank Wood, Andrew Warrington, Saeid Naderiparizi, Christian Weilbach, Vaden Masrani, William Harvey, Adam Scibior, Boyan Beronov, and Ali Nasseri. Planning as inference in epidemiological models. arXiv preprint arXiv:2003.13221, 2020.
- Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
-  App. C. with time kernel kT (i, j) = (1 − ω) 2. At a high level, their proof proceeds by partitioning the T random functions into blocks of length N, and bounding each using Mirsky’s theorem. Referring to Table 1 for notation, this results in a bound on the maximum mutual information γN ≤
-  Eq. (58), we have
-  Eq. (60) (N 5/2 ≤ N 3), where the latter was achieved via a simple constrained optimization argument. Using (26), Theorem 1 follows using identical arguments as in .
-  Appendix G.1).