AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have proposed a family of estimators of the marginal likelihood which illustrate the connection between training speed and Bayesian model selection

A Bayesian Perspective on Training Speed and Model Selection

NIPS 2020, (2020)

Cited by: 0|Views19
EI
Full Text
Bibtex
Weibo

Abstract

We take a Bayesian perspective to illustrate a connection between training speed and the marginal likelihood in linear models. This provides two major insights: first, that a measure of a model's training speed can be used to estimate its marginal likelihood. Second, that this measure, under certain conditions, predicts the relative wei...More

Code:

Data:

0
Introduction
  • Choosing the right inductive bias for a machine learning model, such as convolutional structure for an image dataset, is critical for good generalization.
  • Leveraging the fact that gradient descent can produce exact posterior samples for linear models [31] and the infinite-width limit of deep neural networks [7, 26], the authors show that this estimator can be viewed as the sum of a subset of the model’s training losses in an iterative optimization procedure.
Highlights
  • Choosing the right inductive bias for a machine learning model, such as convolutional structure for an image dataset, is critical for good generalization
  • 4.2 Training Speed, Ensemble Weight, and Generalization in Deep Neural Networks (DNNs) We address our conjectures from Section 3, which aim to generalize our results for linear models to deep neural networks trained with stochastic gradient descent (SGD)
  • We have proposed a family of estimators of the marginal likelihood which illustrate the connection between training speed and Bayesian model selection
  • Because gradient descent can produce exact posterior samples in linear models, our result shows that Bayesian model selection can be done by training a linear model with gradient descent and tracking how quickly it learns
  • We further highlight a connection between magnitude-based pruning and model selection, showing that models for which our lower bound is high will be assigned more weight by an optimal linear model combination. This raises the question of whether similar mechanisms exist in finitely wide neural networks, which do not behave as linear models
  • We provide preliminary empirical evidence that the connections shown in linear models have predictive power towards explaining generalization and training dynamics in DNNs, suggesting a promising avenue for future work
Results
  • The iterative training procedure described in Algorithm 1 will yield a lower bound on the marginal likelihood of this GP using sampled losses from the optimization trajectory of the neural network.
  • This term provides a bound on the rate of convergence of gradient descent, whereas the notion of training speed is more closely related to sample complexity and makes the connection to the marginal likelihood more explicit.
  • Section 3 focused on two key ideas: that training statistics can be used as an estimator for a Bayesian model’s marginal likelihood, and that gradient descent on a linear ensemble implicitly arrives at the same ranking as this estimator in the infinite-sample, infinite-training-time limit.
  • The authors compare the ranking given by the true log marginal likelihood, the estimated L(D), and the weight assigned to each model by the trained linear regressor.
  • The authors find that the rankings of the marginal likelihood, its lower bound, and of the ranking given by concurrent optimization all agree on the best model in all three of the model selection problems outlined previously, while the prior and posterior sampling procedure baselines do not exhibit a consistent ranking with the log ML.
  • 4.2 Training Speed, Ensemble Weight, and Generalization in DNNs The authors address the conjectures from Section 3, which aim to generalize the results for linear models to deep neural networks trained with SGD.
  • Recall that the hypothesis involves translating iterative posterior samples to minibatch training losses over an SGD trajectory, and bayesian model evidence to generalization error; the authors conjectured that just as the sum of the log posterior likelihoods is useful for Bayesian model selection, the sum of minibatch training losses will be useful to predict generalization error.
  • 4.2.1 Linear Combination of DNN Architectures The authors first evaluate whether the sum over training losses (SOTL) obtained over an SGD trajectory correlates with a model’s generalization error, and whether SOTL predicts the weight assigned to a model by a linear ensemble.
Conclusion
  • The authors have proposed a family of estimators of the marginal likelihood which illustrate the connection between training speed and Bayesian model selection.
  • The authors provide preliminary empirical evidence that the connections shown in linear models have predictive power towards explaining generalization and training dynamics in DNNs, suggesting a promising avenue for future work.
Funding
  • Lisa Schut was supported by the Accenture Labs and Alan Turing Institute
Reference
  • Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
    Findings
  • D. Basu. On statistics independent of a complete sufficient statistic. Sankhya: The Indian Journal of Statistics (1933-1960), 15(4):377–380, 1955. ISSN 0036445URL http://www.jstor.org/stable/25048259.
    Locate open access versionFindings
  • Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.
    Findings
  • David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
    Google ScholarLocate open access versionFindings
  • Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015.
    Google ScholarLocate open access versionFindings
  • Andreas Damianou and Neil Lawrence. Deep gaussian processes. volume 31 of Proceedings of Machine Learning Research, pages 207–215, Scottsdale, Arizona, USA, 29 Apr–01 May 2013. PMLR. URL http://proceedings.mlr.press/v31/damianou13a.html.
    Locate open access versionFindings
  • Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1-nGgWC-.
    Locate open access versionFindings
  • Vincent Dutordoir, Mark van der Wilk, Artem Artemev, and James Hensman. Bayesian image classification with deep convolutional gaussian processes. volume 108 of Proceedings of Machine Learning Research, pages 1529–1539, Online, 26–28 Aug 2020. PMLR. URL http://proceedings.mlr.press/v108/dutordoir20a.html.
    Locate open access versionFindings
  • David Duvenaud, Dougal Maclaurin, and Ryan Adams. Early stopping as nonparametric variational inference. In Artificial Intelligence and Statistics, pages 1070–1077, 2016.
    Google ScholarLocate open access versionFindings
  • Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
    Findings
  • Gintare Karolina Dziugaite and Daniel M Roy. Data-dependent PAC-Bayes priors via differential privacy. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, NeurIPS 31, pages 8430–8441. 2018.
    Google ScholarLocate open access versionFindings
  • Stanislav Fort, Paweł Krzysztof Nowak, Stanislaw Jastrzebski, and Srini Narayanan. Stiffness: A new perspective on generalization in neural networks. arXiv preprint arXiv:1901.09491, 2019.
    Findings
  • Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
    Locate open access versionFindings
  • Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
    Google ScholarLocate open access versionFindings
  • Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-Bayesian theory meets Bayesian inference. In Advances in Neural Information Processing Systems, pages 1884–1892, 2016.
    Google ScholarLocate open access versionFindings
  • Alex Graves. Practical variational inference for neural networks. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2348–2356. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf.
    Locate open access versionFindings
  • Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent, 2015.
    Google ScholarFindings
  • Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the neural tangent kernel. arXiv preprint arXiv:2007.05864, 2020.
    Findings
  • Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13, 1993.
    Google ScholarLocate open access versionFindings
  • Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
    Google ScholarLocate open access versionFindings
  • Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them, 2019.
    Google ScholarFindings
  • Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. In Advances in Neural Information Processing Systems, pages 3491–3501, 2019.
    Google ScholarLocate open access versionFindings
  • Mohammad Emtiyaz E Khan, Alexander Immer, Ehsan Abedi, and Maciej Korzepa. Approximate inference turns deep networks into gaussian processes. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3094– 3104. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8573-approximate-inference-turns-deep-networks-into-gaussian-processes.pdf.
    Locate open access versionFindings
  • S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1): 79–86, 03 1951. doi: 10.1214/aoms/1177729694. URL https://doi.org/10.1214/aoms/1177729694.
    Locate open access versionFindings
  • Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402–6413, 2017.
    Google ScholarLocate open access versionFindings
  • Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
    Locate open access versionFindings
  • David JC MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of Technology, 1992.
    Google ScholarFindings
  • David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
    Google ScholarFindings
  • Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pages 13132–13143, 2019.
    Google ScholarLocate open access versionFindings
  • Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research, 18(1):4873–4907, 2017.
    Google ScholarLocate open access versionFindings
  • Alexander G de G Matthews, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Samplethen-optimize posterior sampling for bayesian linear models. Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • David A. McAllester. Some PAC-Bayesian Theorems. Machine Learning, 37(3):355–363, 1999.
    Google ScholarLocate open access versionFindings
  • Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 11615–11626. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. arXiv preprint arXiv:1912.02292, 2019.
    Findings
  • Jeffrey Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M Roy. Information-theoretic generalization bounds for sgld via data-dependent estimates. In Advances in Neural Information Processing Systems, pages 11015–11025, 2019.
    Google ScholarLocate open access versionFindings
  • Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 8617–8629, 2018.
    Google ScholarLocate open access versionFindings
  • Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
    Google ScholarLocate open access versionFindings
  • Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71.
    Google ScholarLocate open access versionFindings
  • Carl Edward Rasmussen and Zoubin Ghahramani. Occam’s razor. In Advances in neural information processing systems, pages 294–300, 2001.
    Google ScholarLocate open access versionFindings
  • Binxin Ru, Clare Lyle, Lisa Schut, Mark van der Wilk, and Yarin Gal. Revisiting the train loss: an efficient performance estimator for neural architecture search, 2020.
    Google ScholarFindings
  • Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.
    Findings
  • Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJij4yg0Z.
    Locate open access versionFindings
  • Guillermo Valle-Pérez, Chico Q Camargo, and Ard A Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522, 2018.
    Findings
  • M. van der Wilk, M. Bauer, S. John, and J. Hensman. Learning Invariances using the Marginal Likelihood. arXiv e-prints, August 2018. _eprint: 1808.05563.
    Findings
  • Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
    Google ScholarLocate open access versionFindings
  • Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791, 2020.
    Findings
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
    Findings
  • 2. Sample k informative features: xi,j ∼ N (yi, σ0) ∀j ∈ 1,... k
    Google ScholarFindings
  • 3. Sample max(d − k, 0) noise features: xi,k+j ∼ N (0, σ1) ∀j ∈ 1,... d − k
    Google ScholarFindings
  • 4. Concatenate the features: Xi = [xi,1,... xi,d]
    Google ScholarFindings
Author
Lisa Schut
Lisa Schut
Robin Ru
Robin Ru
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科