# Deep Neural Networks as Gaussian Processes

ICLR, Volume abs/1711.00165, 2018.

EI

Weibo:

Abstract:

A deep fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP) in the limit of infinite network width. This correspondence enables exact Bayesian inference for neural networks on regression tasks by means of straightforward matrix computations. For single hidden-layer networks, the ...More

Code:

Data:

Introduction

- Deep neural networks have emerged in recent years as flexible parametric models which can fit complex patterns in data.
- Gaussian processes have long served as a traditional nonparametric tool for modeling.
- An equivalence between these two approaches was derived in Neal (1994a), for the case of one layer networks in the limit of infinite width.
- In the case of single hidden-layer networks, the form of the kernel of this GP is well known (Neal (1994a); Williams (1997))

Highlights

- Deep neural networks have emerged in recent years as flexible parametric models which can fit complex patterns in data
- Our experiments reveal that the best Neural Network GP (NNGP) performance is consistently competitive against that of neural network (NN) trained with gradient-based techniques, and the best NNGP setting, chosen across hyperparameters, often surpasses that of conventional training (Section 3, Table 1)
- The baseline neural network is a fully-connected network with identical width at each hidden layer
- Future work may involve evaluating the NNGP on a cross entropy loss using the approach in (Williams & Barber, 1998; Rasmussen & Williams, 2006)
- Use of a Gaussian process (GP) prior on functions enables exact Bayesian inference for regression from matrix computations, and we are able to obtain predictions and uncertainty estimates from deep neural networks without stochastic gradient-based training
- Further investigation is needed to determine if SGD does approximately implement Bayesian inference under the conditions typically employed in practice

Results

- 3.1 DESCRIPTION

The authors compare NNGPs with SGD3 trained neural networks on the permutation invariant MNIST and CIFAR-10 datasets. - Training is on the mean squared error (MSE) loss, chosen so as to allow direct comparison to GP predictions.
- It would be interesting to incorporate dropout into the NNGP covariance matrix using an approach like that in (Schoenholz et al, 2017).
- The authors constructed the covariance kernel numerically for ReLU and Tanh nonlinearities following the method described in Section 2.5.
- Uncertainty: Relationship between the target MSE and the GP’s uncertainty estimate for smaller training set size is shown in Figure 8.
- 0.01 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Output variance CIFAR-1k

Conclusion

**CONCLUSION AND FUTURE DIRECTIONS**

By harnessing the limit of infinite width, the authors have specified a correspondence between priors on deep neural networks and Gaussian processes whose kernel function is constructed in a compositional, but fully deterministic and differentiable, manner.- Use of a GP prior on functions enables exact Bayesian inference for regression from matrix computations, and the authors are able to obtain predictions and uncertainty estimates from deep neural networks without stochastic gradient-based training.
- Weight_var (a) Tanh ordered chaotic σw2.
- The authors observed the performance of the optimized neural network appears to approach that of the GP computation with increasing width.
- Whether gradient-based stochastic optimization implements an approximate Bayesian computation is an interesting question (Mandt et al, 2017).
- Further investigation is needed to determine if SGD does approximately implement Bayesian inference under the conditions typically employed in practice

Summary

## Introduction:

Deep neural networks have emerged in recent years as flexible parametric models which can fit complex patterns in data.- Gaussian processes have long served as a traditional nonparametric tool for modeling.
- An equivalence between these two approaches was derived in Neal (1994a), for the case of one layer networks in the limit of infinite width.
- In the case of single hidden-layer networks, the form of the kernel of this GP is well known (Neal (1994a); Williams (1997))
## Objectives:

Note that it is possible to extend GPs to softmax classification with cross entropy loss (Williams & Barber (1998); Rasmussen & Williams (2006)), which the authors aim to investigate in future work.## Results:

3.1 DESCRIPTION

The authors compare NNGPs with SGD3 trained neural networks on the permutation invariant MNIST and CIFAR-10 datasets.- Training is on the mean squared error (MSE) loss, chosen so as to allow direct comparison to GP predictions.
- It would be interesting to incorporate dropout into the NNGP covariance matrix using an approach like that in (Schoenholz et al, 2017).
- The authors constructed the covariance kernel numerically for ReLU and Tanh nonlinearities following the method described in Section 2.5.
- Uncertainty: Relationship between the target MSE and the GP’s uncertainty estimate for smaller training set size is shown in Figure 8.
- 0.01 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Output variance CIFAR-1k
## Conclusion:

**CONCLUSION AND FUTURE DIRECTIONS**

By harnessing the limit of infinite width, the authors have specified a correspondence between priors on deep neural networks and Gaussian processes whose kernel function is constructed in a compositional, but fully deterministic and differentiable, manner.- Use of a GP prior on functions enables exact Bayesian inference for regression from matrix computations, and the authors are able to obtain predictions and uncertainty estimates from deep neural networks without stochastic gradient-based training.
- Weight_var (a) Tanh ordered chaotic σw2.
- The authors observed the performance of the optimized neural network appears to approach that of the GP computation with increasing width.
- Whether gradient-based stochastic optimization implements an approximate Bayesian computation is an interesting question (Mandt et al, 2017).
- Further investigation is needed to determine if SGD does approximately implement Bayesian inference under the conditions typically employed in practice

- Table1: The NNGP often outperforms finite width networks. Test accuracy on MNIST and CIFAR10 datasets. The reported NNGP results correspond to the best performing depth, σw2 , and σb2 values on the validation set. The traditional NN results correspond to the best performing depth, width and optimization hyperparameters. Best models for a given training set size are specified by (depthwidth-σw2 -σb2) for NNs and (depth–σw2 -σb2) for GPs. More results are in Appendix Table 2
- Table2: Completion of Table 1. The reported NNGP results correspond to the best performing depth, σw2 , and σb2 values on the validation set. The traditional NN results correspond to the best performing depth, width and optimization hyperparameters. Best models for a given training set size are specified by (depth-width-σw2 -σb2) for NNs and (depth–σw2 -σb2) for GPs

Related work

- Our work touches on aspects of GPs, Bayesian learning, and compositional kernels. The correspondence between infinite neural networks and GPs was first noted by Neal (1994a;b). Williams (1997) computes analytic GP kernels for single hidden-layer neural networks with error function or Gaussian nonlinearities and noted the use of the GP prior for exact Bayesian inference in regression. Duvenaud et al (2014) discusses several routes to building deep GPs and observes the degenerate form of kernels that are composed infinitely many times – a point we will return to Section 3.2 – but they do not derive the form of GP kernels as we do. Hazan & Jaakkola (2015) also discusses constructing kernels equivalent to infinitely wide deep neural networks, but their construction does not go beyond two hidden layers with nonlinearities.

Reference

- Maruan Al-Shedivat, Andrew Gordon Wilson, Yunus Saatchi, Zhiting Hu, and Eric P Xing. Learning scalable deep kernels with recurrent structure. Journal of Machine Learning Research, 18(82): 1–37, 2017.
- Thang Bui, Daniel Hernandez-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard Turner. Deep gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, pp. 1472–1481, 2016.
- Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pp. 342–350, 2009.
- Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pp. 207–215, 2013.
- Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253–2261, 2016.
- Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id= H1-nGgWC-.
- David Duvenaud, Oren Rippel, Ryan Adams, and Zoubin Ghahramani. Avoiding pathologies in very deep networks. In Artificial Intelligence and Statistics, pp. 202–210, 2014.
- Yarin Gal. Uncertainty in deep learning. PhD thesis, PhD thesis, University of Cambridge, 2016.
- Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
- Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495. ACM, 2017.
- Tamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from infinite neural networks. arXiv preprint arXiv:1508.05133, 2015.
- James Hensman and Neil D Lawrence. Nested variational compression in deep gaussian processes. arXiv preprint arXiv:1412.1370, 2014.
- James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Karl Krauth, Edwin V Bonilla, Kurt Cutajar, and Maurizio Filippone. Autogp: Exploring the capabilities and limitations of gaussian process models. arXiv preprint arXiv:1610.05392, 2016.
- Neil D Lawrence and Andrew J Moore. Hierarchical gaussian process latent variable models. In Proceedings of the 24th international conference on Machine learning, pp. 481–488. ACM, 2007.
- Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.
- Radford M. Neal. Priors for infinite networks (tech. rep. no. crg-tr-94-1). University of Toronto, 1994a.
- Radford M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, Dept. of Computer Science, 1994b.
- Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pp. 3360–3368, 2016.
- Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
- Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
- Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classification. Journal of machine learning research, 5(Jan):101–141, 2004.
- Ryan Rifkin, Gene Yeo, Tomaso Poggio, et al. Regularized least-squares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003.
- Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. ICLR, 2017.
- Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pp. 295–301, 1997.
- Christopher KI Williams and David Barber. Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.
- Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pp. 2586–2594, 2016a.
- Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378, 2016b.

Full Text

Tags

Comments