AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We have proposed a family of estimators of the marginal likelihood which illustrate the connection between training speed and Bayesian model selection
A Bayesian Perspective on Training Speed and Model Selection
NIPS 2020, (2020)
We take a Bayesian perspective to illustrate a connection between training speed and the marginal likelihood in linear models. This provides two major insights: first, that a measure of a model's training speed can be used to estimate its marginal likelihood. Second, that this measure, under certain conditions, predicts the relative wei...More
PPT (Upload PPT)
- Choosing the right inductive bias for a machine learning model, such as convolutional structure for an image dataset, is critical for good generalization.
- Leveraging the fact that gradient descent can produce exact posterior samples for linear models  and the infinite-width limit of deep neural networks [7, 26], the authors show that this estimator can be viewed as the sum of a subset of the model’s training losses in an iterative optimization procedure.
- Choosing the right inductive bias for a machine learning model, such as convolutional structure for an image dataset, is critical for good generalization
- 4.2 Training Speed, Ensemble Weight, and Generalization in Deep Neural Networks (DNNs) We address our conjectures from Section 3, which aim to generalize our results for linear models to deep neural networks trained with stochastic gradient descent (SGD)
- We have proposed a family of estimators of the marginal likelihood which illustrate the connection between training speed and Bayesian model selection
- Because gradient descent can produce exact posterior samples in linear models, our result shows that Bayesian model selection can be done by training a linear model with gradient descent and tracking how quickly it learns
- We further highlight a connection between magnitude-based pruning and model selection, showing that models for which our lower bound is high will be assigned more weight by an optimal linear model combination. This raises the question of whether similar mechanisms exist in finitely wide neural networks, which do not behave as linear models
- We provide preliminary empirical evidence that the connections shown in linear models have predictive power towards explaining generalization and training dynamics in DNNs, suggesting a promising avenue for future work
- The iterative training procedure described in Algorithm 1 will yield a lower bound on the marginal likelihood of this GP using sampled losses from the optimization trajectory of the neural network.
- This term provides a bound on the rate of convergence of gradient descent, whereas the notion of training speed is more closely related to sample complexity and makes the connection to the marginal likelihood more explicit.
- Section 3 focused on two key ideas: that training statistics can be used as an estimator for a Bayesian model’s marginal likelihood, and that gradient descent on a linear ensemble implicitly arrives at the same ranking as this estimator in the infinite-sample, infinite-training-time limit.
- The authors compare the ranking given by the true log marginal likelihood, the estimated L(D), and the weight assigned to each model by the trained linear regressor.
- The authors find that the rankings of the marginal likelihood, its lower bound, and of the ranking given by concurrent optimization all agree on the best model in all three of the model selection problems outlined previously, while the prior and posterior sampling procedure baselines do not exhibit a consistent ranking with the log ML.
- 4.2 Training Speed, Ensemble Weight, and Generalization in DNNs The authors address the conjectures from Section 3, which aim to generalize the results for linear models to deep neural networks trained with SGD.
- Recall that the hypothesis involves translating iterative posterior samples to minibatch training losses over an SGD trajectory, and bayesian model evidence to generalization error; the authors conjectured that just as the sum of the log posterior likelihoods is useful for Bayesian model selection, the sum of minibatch training losses will be useful to predict generalization error.
- 4.2.1 Linear Combination of DNN Architectures The authors first evaluate whether the sum over training losses (SOTL) obtained over an SGD trajectory correlates with a model’s generalization error, and whether SOTL predicts the weight assigned to a model by a linear ensemble.
- The authors have proposed a family of estimators of the marginal likelihood which illustrate the connection between training speed and Bayesian model selection.
- The authors provide preliminary empirical evidence that the connections shown in linear models have predictive power towards explaining generalization and training dynamics in DNNs, suggesting a promising avenue for future work.
- Lisa Schut was supported by the Accenture Labs and Alan Turing Institute
- Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
- D. Basu. On statistics independent of a complete sufficient statistic. Sankhya: The Indian Journal of Statistics (1933-1960), 15(4):377–380, 1955. ISSN 0036445URL http://www.jstor.org/stable/25048259.
- Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.
- David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
- Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015.
- Andreas Damianou and Neil Lawrence. Deep gaussian processes. volume 31 of Proceedings of Machine Learning Research, pages 207–215, Scottsdale, Arizona, USA, 29 Apr–01 May 2013. PMLR. URL http://proceedings.mlr.press/v31/damianou13a.html.
- Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1-nGgWC-.
- Vincent Dutordoir, Mark van der Wilk, Artem Artemev, and James Hensman. Bayesian image classification with deep convolutional gaussian processes. volume 108 of Proceedings of Machine Learning Research, pages 1529–1539, Online, 26–28 Aug 2020. PMLR. URL http://proceedings.mlr.press/v108/dutordoir20a.html.
- David Duvenaud, Dougal Maclaurin, and Ryan Adams. Early stopping as nonparametric variational inference. In Artificial Intelligence and Statistics, pages 1070–1077, 2016.
- Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
- Gintare Karolina Dziugaite and Daniel M Roy. Data-dependent PAC-Bayes priors via differential privacy. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, NeurIPS 31, pages 8430–8441. 2018.
- Stanislav Fort, Paweł Krzysztof Nowak, Stanislaw Jastrzebski, and Srini Narayanan. Stiffness: A new perspective on generalization in neural networks. arXiv preprint arXiv:1901.09491, 2019.
- Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
- Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
- Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-Bayesian theory meets Bayesian inference. In Advances in Neural Information Processing Systems, pages 1884–1892, 2016.
- Alex Graves. Practical variational inference for neural networks. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2348–2356. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf.
- Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent, 2015.
- Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the neural tangent kernel. arXiv preprint arXiv:2007.05864, 2020.
- Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13, 1993.
- Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
- Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them, 2019.
- Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. In Advances in Neural Information Processing Systems, pages 3491–3501, 2019.
- Mohammad Emtiyaz E Khan, Alexander Immer, Ehsan Abedi, and Maciej Korzepa. Approximate inference turns deep networks into gaussian processes. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3094– 3104. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8573-approximate-inference-turns-deep-networks-into-gaussian-processes.pdf.
- S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1): 79–86, 03 1951. doi: 10.1214/aoms/1177729694. URL https://doi.org/10.1214/aoms/1177729694.
- Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402–6413, 2017.
- Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
- David JC MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of Technology, 1992.
- David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
- Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pages 13132–13143, 2019.
- Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research, 18(1):4873–4907, 2017.
- Alexander G de G Matthews, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Samplethen-optimize posterior sampling for bayesian linear models. Neural Information Processing Systems, 2017.
- David A. McAllester. Some PAC-Bayesian Theorems. Machine Learning, 37(3):355–363, 1999.
- Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 11615–11626. Curran Associates, Inc., 2019.
- Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. arXiv preprint arXiv:1912.02292, 2019.
- Jeffrey Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M Roy. Information-theoretic generalization bounds for sgld via data-dependent estimates. In Advances in Neural Information Processing Systems, pages 11015–11025, 2019.
- Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 8617–8629, 2018.
- Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
- Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71.
- Carl Edward Rasmussen and Zoubin Ghahramani. Occam’s razor. In Advances in neural information processing systems, pages 294–300, 2001.
- Binxin Ru, Clare Lyle, Lisa Schut, Mark van der Wilk, and Yarin Gal. Revisiting the train loss: an efficient performance estimator for neural architecture search, 2020.
- Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.
- Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJij4yg0Z.
- Guillermo Valle-Pérez, Chico Q Camargo, and Ard A Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522, 2018.
- M. van der Wilk, M. Bauer, S. John, and J. Hensman. Learning Invariances using the Marginal Likelihood. arXiv e-prints, August 2018. _eprint: 1808.05563.
- Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
- Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791, 2020.
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
- 2. Sample k informative features: xi,j ∼ N (yi, σ0) ∀j ∈ 1,... k
- 3. Sample max(d − k, 0) noise features: xi,k+j ∼ N (0, σ1) ∀j ∈ 1,... d − k
- 4. Concatenate the features: Xi = [xi,1,... xi,d]