AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We proved a stability condition for the mean-field system induced by contour stochastic gradient Langevin dynamics together with the convergence of its self-adapting parameter θ to a unique fixed point θ

A Contour Stochastic Gradient Langevin Dynamics Algorithm For Simulations Of Multi-Modal Distributions

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS (NEURIPS 2020), (2020): 15725-15736

Cited by: 1|Views68
EI WOS

Abstract

We propose an adaptively weighted stochastic gradient Langevin dynamics algorithm (SGLD), so-called contour stochastic gradient Langevin dynamics (CSGLD), for Bayesian learning in big data statistics. The proposed algorithm is essentially a scalable dynamic importance sampler, which automatically flattens the target distribution such that...More

Code:

Data:

0
Introduction
  • AI safety has long been an important issue in the deep learning community. A promising solution to the problem is Markov chain Monte Carlo (MCMC), which leads to asymptotically correct uncertainty quantification for deep neural network (DNN) models.
  • Theoretical studies [Lelièvre et al, 2008, Liang, 2010, Fort et al, 2015] support the efficiency of the flat histogram algorithms in Monte Carlo computing for small data problems.
  • The use of the stochastic index J(x) avoids the evaluation of U (x) on the full data and significantly accelerates the computation of the algorithm, it leads to a small bias, depending on the mini-batch size n, in parameter estimation.
Highlights
  • AI safety has long been an important issue in the deep learning community
  • This paper proposes the so-called contour stochastic gradient Langevin dynamics (CSGLD) algorithm, which successfully extends the flat histogram idea to stochastic gradient Markov chain Monte Carlo (SGMCMC)
  • The performance of the algorithm was evaluated by averaging over 50 models, where the averaging estimator was used for stochastic gradient descent (SGD) and SGLD and the weighted averaging estimator was used for CSGLD
  • We have proposed CSGLD as a general scalable Monte Carlo algorithm for both simulation and optimization tasks
  • CSGLD automatically adjusts the invariant distribution during simulations to facilitate escaping from local traps and traversing over the entire energy landscape
  • We proved a stability condition for the mean-field system induced by CSGLD together with the convergence of its self-adapting parameter θ to a unique fixed point θ
Results
  • The authors study the convergence of CSGLD algorithm under the framework of stochastic approximation and show the ergodicity property based on weighted averaging estimators.
  • The bias of the weighted averaging estimator decreases if one applies a larger batch size, a finer sample space partition, a smaller learning rate , and smaller step sizes {ωk}k≥0.
  • As shown in Fig.1(c), the estimation error of SGLD decays quite slow and rarely converges due to the high energy barrier.
  • CSGLD first simulates the importance samples and recovers the original distribution according to the importance weights.
  • The performance of the algorithm was evaluated by averaging over 50 models, where the averaging estimator was used for SGD and SGLD and the weighted averaging estimator was used for CSGLD.
  • As shown in Table 1, SGLD outperforms the stochastic gradient descent (SGD) algorithm for most datasets due to the advantage of a sampling algorithm in obtaining more informative modes.
  • In the first set of experiments, all the algorithms utilized a fixed learning rate = 2e − 7 and a fixed temperature τ = 0.01 under the Bayesian setting.SGHMC performs quite to M-SGD, both obtaining around 90% accuracy in BPE and 92% in BMA.
  • Instead of simulating from π(x) directly, CSGHMC adaptively simulates from a flattened distribution θ and adjusts the sampling bias by dynamic importance weights.
  • CSGLD automatically adjusts the invariant distribution during simulations to facilitate escaping from local traps and traversing over the entire energy landscape.
Conclusion
  • The bias of the estimator decreases as the authors employ a finer partition, a larger mini-batch size, and smaller learning rates and step sizes.
  • It is an extension of the flat histogram algorithms from the Metropolis kernel to the Langevin kernel and paves the way for future research in various dynamic importance samplers and adaptive biasing force (ABF) techniques for big data problems.
  • The authors tested CSGLD and its variants on a few examples, which show their great potential in deep learning and big data computing
Tables
  • Table1: Algorithm evaluation using average root-mean-square error and its standard deviation
  • Table2: Experiments on CIFAR10 & 100 using Resnet20, where BPE and BMA are short for best point estimate and Bayesian model average, respectively
Download tables as Excel
Funding
  • Liang’s research was supported in part by the grants DMS-2015498, R01-GM117597 and R01GM126089
  • Lin acknowledges the support from NSF (DMS-1555072, DMS-1736364), BNL Subcontract 382247, W911NF-15-1-0562, and DE-SC0021142
Reference
  • Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring. In Proc. of the International Conference on Machine Learning (ICML), 2012.
    Google ScholarLocate open access versionFindings
  • Christophe Andrieu and Éric Moulines. On the Ergodicity Properties of Some Adaptive MCMC Algorithms. Annals of Applied Probability, 16:1462–1505, 2006.
    Google ScholarLocate open access versionFindings
  • Christophe Andrieu, Éric Moulines, and Pierre Priouret. Stability of Stochastic Approximation under Verifiable Conditions. SIAM J. Control Optim., 44(1):283–312, 2005.
    Google ScholarLocate open access versionFindings
  • Albert Benveniste, Michael Métivier, and Pierre Priouret. Adaptive Algorithms and Stochastic Approximations. Berlin: Springer, 1990.
    Google ScholarFindings
  • Bernd A. Berg and T. Neuhaus. Multicanonical Algorithms for First Order Phase Transitions. Physics Letters B, 267(2):249–253, 1991.
    Google ScholarLocate open access versionFindings
  • Changyou Chen, Nan Ding, and Lawrence Carin. On the Convergence of Stochastic Gradient MCMC Algorithms with High-order Integrators. In Advances in Neural Information Processing Systems (NeurIPS), pages 2278–2286, 2015.
    Google ScholarLocate open access versionFindings
  • Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. In Proc. of the International Conference on Machine Learning (ICML), 2014.
    Google ScholarLocate open access versionFindings
  • Umut Simsekli, Roland Badeau, A. Taylan Cemgil, and Gaë Richard. Stochastic Quasi-Newton Langevin Monte Carlo. In Proc. of the International Conference on Machine Learning (ICML), pages 642–651, 2016.
    Google ScholarLocate open access versionFindings
  • Wei Deng, Xiao Zhang, Faming Liang, and Guang Lin. An Adaptive Empirical Bayesian Method for Sparse Deep Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
    Google ScholarLocate open access versionFindings
  • Wei Deng, Qi Feng, Liyao Gao, Faming Liang, and Guang Lin. Non-Convex Learning via Replica Exchange Stochastic Gradient MCMC. In Proc. of the International Conference on Machine Learning (ICML), 2020a.
    Google ScholarLocate open access versionFindings
  • Wei Deng, Qi Feng, Georgios Karagiannis, Guang Lin, and Faming Liang. Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction. arXiv:2010.01084, 2020b.
    Findings
  • Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut Neven. Bayesian Sampling using Stochastic Gradient Thermostats. In Advances in Neural Information Processing Systems (NeurIPS), pages 3203–3211, 2014.
    Google ScholarLocate open access versionFindings
  • G. Fort, E. Moulines, and P. Priouret. Convergence of Adaptive and Interacting Markov Chain Monte Carlo Algorithms. Annals of Statistics, 39:3262–3289, 2011.
    Google ScholarLocate open access versionFindings
  • G. Fort, B. Jourdain, E. Kuhn, T. Lelièvre, and G. Stoltz. Convergence of the Wang-Landau Algorithm. Math. Comput., 84(295):2297–2327, 2015.
    Google ScholarLocate open access versionFindings
  • Charles J. Geyer. Markov Chain Monte Carlo Maximum Likelihood. Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interfac, pages 156–163, 1991.
    Google ScholarLocate open access versionFindings
  • W.K. Hastings. Monte Carlo Sampling Methods using Markov Chain and Their Applications. Biometrika, 57:97–109, 1970.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    Google ScholarLocate open access versionFindings
  • Jose Miguel Hernandez-Lobato and Ryan Adams. Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. In Proc. of the International Conference on Machine Learning (ICML), volume 37, pages 1861–1869, 2015.
    Google ScholarLocate open access versionFindings
  • Scott Kirkpatrick, D. Gelatt Jr, and Mario P. Vecchi. Optimization by Simulated Annealing. Science, 220(4598):671–680, 1983.
    Google ScholarLocate open access versionFindings
  • T. Lelièvre, M. Rousset, and G. Stoltz. Long-time Convergence of an Adaptive Biasing Force Method. Nonlinearity, 21:1155–1181, 2008.
    Google ScholarLocate open access versionFindings
  • Chunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin. Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks. In Proc. of the National Conference on Artificial Intelligence (AAAI), pages 1788–1794, 2016.
    Google ScholarLocate open access versionFindings
  • Xuechen Li, Denny Wu, Lester Mackey, and Murat A. Erdogdu. Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond. In Advances in Neural Information Processing Systems (NeurIPS), pages 7746–7758, 2019.
    Google ScholarLocate open access versionFindings
  • Faming Liang. A Generalized Wang–Landau Algorithm for Monte Carlo Computation. Journal of the American Statistical Association, 100(472):1311–1327, 2005.
    Google ScholarLocate open access versionFindings
  • Faming Liang. On the Use of Stochastic Approximation Monte Carlo for Monte Carlo Integration. Statistics and Probability Letters, 79:581–587, 2009.
    Google ScholarLocate open access versionFindings
  • Faming Liang. Trajectory Averaging for Stochastic Approximation MCMC Algorithms. The Annals of Statistics, 38:2823–2856, 2010.
    Google ScholarLocate open access versionFindings
  • Faming Liang, Chuanhai Liu, and Raymond J. Carroll. Stochastic Approximation in Monte Carlo Computation. Journal of the American Statistical Association, 102:305–320, 2007.
    Google ScholarLocate open access versionFindings
  • Yi-An Ma, Tianqi Chen, and Emily B. Fox. A Complete Recipe for Stochastic Gradient MCMC. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
    Google ScholarLocate open access versionFindings
  • Oren Mangoubi and Nisheeth K. Vishnoi. Convex Optimization with Unbounded Nonconvex Oracles using Simulated Annealing. In Proc. of Conference on Learning Theory (COLT), 2018.
    Google ScholarLocate open access versionFindings
  • J.C. Mattingly, A.M. Stuartb, and D.J. Highamc. Ergodicity for SDEs and Approximations: Locally Lipschitz Vector Fields and Degenerate Noise. Stochastic Processes and their Applications, 101: 185–232, 2002.
    Google ScholarLocate open access versionFindings
  • Jonathan C. Mattingly, Andrew M. Stuart, and M.V. Tretyakov. Convergence of Numerical TimeAveraging and Stationary Measures via Poisson Equations. SIAM Journal on Numerical Analysis, 48:552–577, 2010.
    Google ScholarLocate open access versionFindings
  • N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics, 21:1087–1091, 1953.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, 2017.
    Google ScholarLocate open access versionFindings
  • Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex Learning via Stochastic Gradient Langevin Dynamics: a Nonasymptotic Analysis. In Proc. of Conference on Learning Theory (COLT), June 2017.
    Google ScholarLocate open access versionFindings
  • Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. Annals of Mathematical Statistics, 22:400–407, 1951.
    Google ScholarLocate open access versionFindings
  • Gerneth O. Roberts and Jeff S. Rosenthal. Coupling and Ergodicity of Adaptive Markov Chain Monte Carlo Algorithms. Journal of Applied Probability, 44:458–475, 2007.
    Google ScholarLocate open access versionFindings
  • Yunus Saatci and Andrew G Wilson. Bayesian GAN. In Advances in Neural Information Processing Systems (NeurIPS), pages 3622–3631, 2017.
    Google ScholarLocate open access versionFindings
  • Issei Sato and Hiroshi Nakagawa. Approximation Analysis of Stochastic Gradient Langevin Dynamics by Using Fokker-Planck Equation and Ito Process. In Proc. of the International Conference on Machine Learning (ICML), 2014.
    Google ScholarLocate open access versionFindings
  • Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo Simulation of Spin-Glasses. Phys. Rev. Lett., 57:2607–2609, 1986.
    Google ScholarLocate open access versionFindings
  • Yee Whye Teh, Alexandre Thiéry, and Sebastian Vollmer. Consistency and Fluctuations for Stochastic Gradient Langevin Dynamics. Journal of Machine Learning Research, 17:1–33, 2016.
    Google ScholarLocate open access versionFindings
  • Eric Vanden-Eijnden. Introduction to Regular Perturbation Theory. Slides, 2001. URL https://cims.nyu.edu/~eve2/reg_pert.pdf.
    Locate open access versionFindings
  • Sebastian J. Vollmer, Konstantinos C. Zygalakis, and Yee Whye Teh. Exploration of the (Non-) Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics. Journal of Machine Learning Research, 17(159):1–48, 2016.
    Google ScholarLocate open access versionFindings
  • Fugao Wang and D. P. Landau. Efficient, Multiple-range Random Walk Algorithm to Calculate the Density of States. Physical Review Letters, 86(10):2050–2053, 2001.
    Google ScholarLocate open access versionFindings
  • Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proc. of the International Conference on Machine Learning (ICML), pages 681–688, 2011.
    Google ScholarLocate open access versionFindings
  • Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
    Google ScholarLocate open access versionFindings
  • Mao Ye, Tongzheng Ren, and Qiang Liu. Stein Self-Repulsive Dynamics: Benefits From Past Samples. arXiv:2002.09070v1, 2020.
    Findings
  • Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning. In Proc. of the International Conference on Learning Representation (ICLR), 2020.
    Google ScholarLocate open access versionFindings
  • Yuchen Zhang, Percy Liang, and Moses Charikar. A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics. In Proc. of Conference on Learning Theory (COLT), pages 1980–2022, 2017.
    Google ScholarLocate open access versionFindings
  • Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmentation. ArXiv e-prints, 2017.
    Google ScholarFindings
  • Stochastic approximation [Benveniste et al., 1990] provides a standard framework for the development of adaptive algorithms. Given a random field function H(θ, x), the goal of the stochastic approximation algorithm is to find the solution to the mean-field equation h(θ) = 0, i.e., solving h(θ) = H(θ, x) θ(dx) = 0, X
    Google ScholarLocate open access versionFindings
  • (1) Draw xk+1 ∼ Πθk (xk, ·), where Πθk (xk, ·) is a transition kernel that admits θk (x) as the invariant distribution, (2) Update θk+1 = θk + ωk+1H(θk, xk+1) + ωk2+1ρ(θk, xk+1), where ρ(·, ·) denotes a bias term.
    Google ScholarLocate open access versionFindings
  • The algorithm differs from the Robbins–Monro algorithm [Robbins and Monro, 1951] in that x is simulated from a transition kernel Πθk (·, ·) instead of the exact distribution θk (·). As a result, a Markov state-dependent noise H(θk, xk+1) − h(θk) is generated, which requires some regularity conditions to control the fluctuation k Πkθ(H(θ, x) − h(θ)). Moreover, it supports a more general form where a bounded bias term ρ(·, ·) is allowed without affecting the theoretical properties of the algorithm.
    Google ScholarFindings
  • To solve the ODE system with small disturbances, we consider standard techniques in perturbation theory. According to the fundamental theorem of perturbation theory [Vanden-Eijnden, 2001], we can obtain the solution to the mean field equation h(θ) = 0: θ(i) = θ (i) + εβi(θ ) + O(ε2), i = 1, 2,..., m, (25)
    Google ScholarFindings
Author
Wei Deng
Wei Deng
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科