## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# How does Weight Correlation Affect Generalisation Ability of Deep Neural Networks?

NIPS 2020, (2020)

EI

Keywords

Abstract

This paper studies the novel concept of weight correlation in deep neural networks and discusses its impact on the networks' generalisation ability. For fully-connected layers, the weight correlation is defined as the average cosine similarity between weight vectors of neurons, and for convolutional layers, the weight correlation is def...More

Code:

Data:

Introduction

- Evidence in neuroscience has suggested that correlation between neurons plays a key role in the encoding and computation of information in the brain (Cohen and Kohn, 2011; Kohn and Smith, 2005).
- Provide evidence that it correlates with the generalisation ability of networks—one of the most important concepts in machine learning that reflects how accurately a learning algorithm is able to predict over previously unseen data.
- The key concept of generalisation ability can be quantified by the generalisation error (GE).
- The authors' observation that the WC correlates positively with GE opens an exciting new avenue to reduce the GE, and to improve the generalisation ability of networks

Highlights

- Evidence in neuroscience has suggested that correlation between neurons plays a key role in the encoding and computation of information in the brain (Cohen and Kohn, 2011; Kohn and Smith, 2005)
- We found that Weight Correlation Descent Method (WCD) improves the generalisation performance on some levels compared to neural networks that did not use this method
- We have introduced weight correlation and discussed its importance to the generalisation ability of neural networks
- We have injected WC into the popular PAC-Bayesian framework to derive a closed-form expression of the generalisation gap bound with mild assumption on weight distribution, and employed it as an explicit regulariser—weight correlation decent (WCD)—to enhance generalisation performance within training
- Considering weight correlation has proven to significantly enhance the complexity measure—that predicts the ranking of networks with respect to their generalisation errors—and the regularisation—that improves the generalisation performance of the trained model
- Sampling methods for the estimation of high-dimensional posterior distributions, such as Markov chain Monte Carlo (MCMC) sampling, could be employed here. Another practical issue concerns further complexity reduction of WCD: WC is easy to compute in the forward propagation, computing WCD gradients in the back-propagation is more involved—which is crucial for the potential deployment of WCD in very large neural networks

Methods

- The authors have conducted experiments to study the effectiveness of the new complexity measure in predicting GE (Section 6.1 and the effectiveness of exploiting WC during training to reduce GE (Section 6.2).

6.1 Complexity Measures

Following (Chatterji et al, 2020), the authors have trained several networks on the CIFAR-10 and CIFAR-100 datasets to compare the network complexity measure to earlier ones from the literature. - The authors have conducted experiments to study the effectiveness of the new complexity measure in predicting GE (Section 6.1 and the effectiveness of exploiting WC during training to reduce GE (Section 6.2).
- Following (Chatterji et al, 2020), the authors have trained several networks on the CIFAR-10 and CIFAR-100 datasets to compare the network complexity measure to earlier ones from the literature.
- The CIFAR10 and CIFAR-100 datasets both consist of 3 × 32 × 32 coloured pixel images of 50,000 training examples and 10,000 testing examples.
- The quantity PSN was proposed by (Bartlett et al, 2017) and SoSP was proposed by (Long and Sedghi, 2019)

Results

- The authors found that WCD improves the generalisation performance on some levels compared to neural networks that did not use this method.

Conclusion

**Conclusion and Future Work**

The authors have introduced weight correlation and discussed its importance to the generalisation ability of neural networks.- Sampling methods for the estimation of high-dimensional posterior distributions, such as Markov chain Monte Carlo (MCMC) sampling, could be employed here.
- Another practical issue concerns further complexity reduction of WCD: WC is easy to compute in the forward propagation, computing WCD gradients in the back-propagation is more involved—which is crucial for the potential deployment of WCD in very large neural networks

Summary

## Introduction:

Evidence in neuroscience has suggested that correlation between neurons plays a key role in the encoding and computation of information in the brain (Cohen and Kohn, 2011; Kohn and Smith, 2005).- Provide evidence that it correlates with the generalisation ability of networks—one of the most important concepts in machine learning that reflects how accurately a learning algorithm is able to predict over previously unseen data.
- The key concept of generalisation ability can be quantified by the generalisation error (GE).
- The authors' observation that the WC correlates positively with GE opens an exciting new avenue to reduce the GE, and to improve the generalisation ability of networks
## Methods:

The authors have conducted experiments to study the effectiveness of the new complexity measure in predicting GE (Section 6.1 and the effectiveness of exploiting WC during training to reduce GE (Section 6.2).

6.1 Complexity Measures

Following (Chatterji et al, 2020), the authors have trained several networks on the CIFAR-10 and CIFAR-100 datasets to compare the network complexity measure to earlier ones from the literature.- The authors have conducted experiments to study the effectiveness of the new complexity measure in predicting GE (Section 6.1 and the effectiveness of exploiting WC during training to reduce GE (Section 6.2).
- Following (Chatterji et al, 2020), the authors have trained several networks on the CIFAR-10 and CIFAR-100 datasets to compare the network complexity measure to earlier ones from the literature.
- The CIFAR10 and CIFAR-100 datasets both consist of 3 × 32 × 32 coloured pixel images of 50,000 training examples and 10,000 testing examples.
- The quantity PSN was proposed by (Bartlett et al, 2017) and SoSP was proposed by (Long and Sedghi, 2019)
## Results:

The authors found that WCD improves the generalisation performance on some levels compared to neural networks that did not use this method.## Conclusion:

**Conclusion and Future Work**

The authors have introduced weight correlation and discussed its importance to the generalisation ability of neural networks.- Sampling methods for the estimation of high-dimensional posterior distributions, such as Markov chain Monte Carlo (MCMC) sampling, could be employed here.
- Another practical issue concerns further complexity reduction of WCD: WC is easy to compute in the forward propagation, computing WCD gradients in the back-propagation is more involved—which is crucial for the potential deployment of WCD in very large neural networks

- Table1: Complexity Measures (Measured Quantities)
- Table2: Complexity measures for CIFAR-10
- Table3: Comparison of different models with and without WCD
- Table4: Complexity measures for CIFAR-100
- Table5: The architectures of FCN3, VGG11*, VGG16*, VGG19*

Related work

- Evaluating the generalisation performance of neural networks has been a research focus since Baum and Haussler (1989). Many generalisation bounds and complexity measures have been proposed so far. Bartlett (1998) highlighted the significance of the norm of the weights in predicting the generalisation error. Since then, various analysis techniques have been proposed. They have either been based on covering number and Rademacher complexity (Bartlett et al, 2017; Neyshabur et al, 2018, 2015), or they have used approaches similar to PAC-Bayes (Neyshabur et al, 2017; Arora et al, 2018; Nagarajan and Kolter, 2019a; Zhou et al, 2018). A number of recent theoretical works have shown that, for a large network initialised in this way, accurate models can be found by traveling a short distance in parameter space (Du et al, 2019; Allen-Zhu et al, 2019). Thus, the required distance from the initialisation may be expected to be significantly smaller than the magnitude of the weights. Furthermore, there is theoretical reason to expect that, as the number of parameters increases, the distance from the initialisation falls. This has motivated works that focus on the role of the distance to initialisation rather than on the norm of the weights in generalisation (Dziugaite and Roy, 2017; Nagarajan and Kolter, 2019b; Long and Sedghi, 2020). Recently, Chatterji et al (2020) introduced module criticality and analysed how different modules in the network interact with each other and influence the generalisation performance as a whole.

Funding

- GJ is supported by a University of Liverpool PhD scholarship
- SS is supported by the UK EPSRC project [EP/P020909/1], and XH is supported by the UK EPSRC projects [EP/R026173/1,EP/T026995/1]
- Both XH and SS are supported by the UK Dstl project [TCMv2]. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 956123
- LijunZ is supported by the Guangdong Science and Technology Department [Grant no. 2018B010107004], and NSFC [Grant Nos. 61761136011,61532019]

Reference

- Allen-Zhu, Z., Li, Y., and Song, Z. (2019). A convergence theory for deep learning via overparameterization. International Conference on Machine Learning (ICML).
- Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. International Conference on Machine Learning (ICML).
- Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525–536.
- Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249.
- Baum, E. B. and Haussler, D. (1989). What size net gives valid generalization? In Advances in neural information processing systems, pages 81–90.
- Chatterji, N. S., Neyshabur, B., and Sedghi, H. (2020). The intriguing role of module criticality in the generalization of deep networks. International Conference on Learning Representations (ICLR).
- Cohen, M. R. and Kohn, A. (2011). Measuring and interpreting neuronal correlations. Nature Neuroscience, 14(7):811–819.
- Du, S. S., Zhai, X., Poczos, B., and Singh, A. (2019). Gradient descent provably optimizes overparameterized neural networks. International Conference on Learning Representations (ICLR).
- Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Conference on Uncertainty in Artificial Intelligence (UAI).
- Goldfeld, Z., Berg, E. v. d., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., and Polyanskiy, Y. (2019). Estimating information flow in deep neural networks. International Conference on Machine Learning (ICML).
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT press.
- He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017a). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708.
- Huang, X., Kwiatkowska, M., Wang, S., and Wu, M. (2017b). Safety verification of deep neural networks. In Majumdar, R. and Kuncak, V., editors, Computer Aided Verification, pages 3–29, Cham. Springer International Publishing.
- Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. (2020). Fantastic generalization measures and where to find them. International Conference on Learning Representations (ICLR).
- Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2):81–93.
- Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836.
- Kohn, A. and Smith, M. A. (2005). Stimulus dependence of neuronal correlation in primary visual cortex of the macaque. Journal of Neuroscience, 25(14):3661–3673.
- Kolchinsky, A. and Tracey, B. (2017). Estimating mixture entropy with pairwise distances. Entropy, 19(7):361.
- Kullback, S. and Leibler, R. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1):79–86.
- Lombardi, D. and Pant, S. (2016). Nonparametric k-nearest-neighbor entropy estimator. Physical Review E, 93(1):013310.
- Long, P. and Sedghi, H. (2020). Generalization bounds for deep convolutional neural networks. International Conference on Learning Representations (ICLR).
- Long, P. M. and Sedghi, H. (2019). Size-free generalization bounds for convolutional neural networks. arXiv preprint arXiv:1905.12600.
- McAllester, D. A. (1999). PAC-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164–170.
- Nagarajan, V. and Kolter, J. Z. (2019a). Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. International Conference on Learning Representations (ICLR).
- Nagarajan, V. and Kolter, J. Z. (2019b). Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672.
- Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). Exploring generalization in deep learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 5949–5958, USA. Curran Associates Inc.
- Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. (2018). Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076.
- Neyshabur, B., Tomioka, R., and Srebro, N. (2015). Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401.
- Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR).
- Singh, S. and Póczos, B. (2017). Nonparanormal information estimation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3210–3219. JMLR. org.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.
- Yao, Y., Rosasco, L., and Caponnetto, A. (2007). On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315.
- Yi, X. and Au, E. K. (2011). User scheduling for heterogeneous multiuser MIMO systems: a subspace viewpoint. IEEE Transactions on Vehicular Technology, 60(8):4004–4013.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. International Conference on Learning Representations (ICLR).
- Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. (2018). Non-vacuous generalization bounds at the imagenet scale: a PAC-bayesian compression approach. International Conference on Learning Representations (ICLR).

Tags

Comments