AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This paper studies the novel concept of weight correlation in deep neural networks and discusses its impact on the networks' generalisation ability

How does Weight Correlation Affect Generalisation Ability of Deep Neural Networks?

NIPS 2020, (2020)

Cited by: 0|Views46
EI
Full Text
Bibtex
Weibo

Abstract

This paper studies the novel concept of weight correlation in deep neural networks and discusses its impact on the networks' generalisation ability. For fully-connected layers, the weight correlation is defined as the average cosine similarity between weight vectors of neurons, and for convolutional layers, the weight correlation is def...More

Code:

Data:

0
Introduction
  • Evidence in neuroscience has suggested that correlation between neurons plays a key role in the encoding and computation of information in the brain (Cohen and Kohn, 2011; Kohn and Smith, 2005).
  • Provide evidence that it correlates with the generalisation ability of networks—one of the most important concepts in machine learning that reflects how accurately a learning algorithm is able to predict over previously unseen data.
  • The key concept of generalisation ability can be quantified by the generalisation error (GE).
  • The authors' observation that the WC correlates positively with GE opens an exciting new avenue to reduce the GE, and to improve the generalisation ability of networks
Highlights
  • Evidence in neuroscience has suggested that correlation between neurons plays a key role in the encoding and computation of information in the brain (Cohen and Kohn, 2011; Kohn and Smith, 2005)
  • We found that Weight Correlation Descent Method (WCD) improves the generalisation performance on some levels compared to neural networks that did not use this method
  • We have introduced weight correlation and discussed its importance to the generalisation ability of neural networks
  • We have injected WC into the popular PAC-Bayesian framework to derive a closed-form expression of the generalisation gap bound with mild assumption on weight distribution, and employed it as an explicit regulariser—weight correlation decent (WCD)—to enhance generalisation performance within training
  • Considering weight correlation has proven to significantly enhance the complexity measure—that predicts the ranking of networks with respect to their generalisation errors—and the regularisation—that improves the generalisation performance of the trained model
  • Sampling methods for the estimation of high-dimensional posterior distributions, such as Markov chain Monte Carlo (MCMC) sampling, could be employed here. Another practical issue concerns further complexity reduction of WCD: WC is easy to compute in the forward propagation, computing WCD gradients in the back-propagation is more involved—which is crucial for the potential deployment of WCD in very large neural networks
Methods
  • The authors have conducted experiments to study the effectiveness of the new complexity measure in predicting GE (Section 6.1 and the effectiveness of exploiting WC during training to reduce GE (Section 6.2).

    6.1 Complexity Measures

    Following (Chatterji et al, 2020), the authors have trained several networks on the CIFAR-10 and CIFAR-100 datasets to compare the network complexity measure to earlier ones from the literature.
  • The authors have conducted experiments to study the effectiveness of the new complexity measure in predicting GE (Section 6.1 and the effectiveness of exploiting WC during training to reduce GE (Section 6.2).
  • Following (Chatterji et al, 2020), the authors have trained several networks on the CIFAR-10 and CIFAR-100 datasets to compare the network complexity measure to earlier ones from the literature.
  • The CIFAR10 and CIFAR-100 datasets both consist of 3 × 32 × 32 coloured pixel images of 50,000 training examples and 10,000 testing examples.
  • The quantity PSN was proposed by (Bartlett et al, 2017) and SoSP was proposed by (Long and Sedghi, 2019)
Results
  • The authors found that WCD improves the generalisation performance on some levels compared to neural networks that did not use this method.
Conclusion
  • Conclusion and Future Work

    The authors have introduced weight correlation and discussed its importance to the generalisation ability of neural networks.
  • Sampling methods for the estimation of high-dimensional posterior distributions, such as Markov chain Monte Carlo (MCMC) sampling, could be employed here.
  • Another practical issue concerns further complexity reduction of WCD: WC is easy to compute in the forward propagation, computing WCD gradients in the back-propagation is more involved—which is crucial for the potential deployment of WCD in very large neural networks
Summary
  • Introduction:

    Evidence in neuroscience has suggested that correlation between neurons plays a key role in the encoding and computation of information in the brain (Cohen and Kohn, 2011; Kohn and Smith, 2005).
  • Provide evidence that it correlates with the generalisation ability of networks—one of the most important concepts in machine learning that reflects how accurately a learning algorithm is able to predict over previously unseen data.
  • The key concept of generalisation ability can be quantified by the generalisation error (GE).
  • The authors' observation that the WC correlates positively with GE opens an exciting new avenue to reduce the GE, and to improve the generalisation ability of networks
  • Methods:

    The authors have conducted experiments to study the effectiveness of the new complexity measure in predicting GE (Section 6.1 and the effectiveness of exploiting WC during training to reduce GE (Section 6.2).

    6.1 Complexity Measures

    Following (Chatterji et al, 2020), the authors have trained several networks on the CIFAR-10 and CIFAR-100 datasets to compare the network complexity measure to earlier ones from the literature.
  • The authors have conducted experiments to study the effectiveness of the new complexity measure in predicting GE (Section 6.1 and the effectiveness of exploiting WC during training to reduce GE (Section 6.2).
  • Following (Chatterji et al, 2020), the authors have trained several networks on the CIFAR-10 and CIFAR-100 datasets to compare the network complexity measure to earlier ones from the literature.
  • The CIFAR10 and CIFAR-100 datasets both consist of 3 × 32 × 32 coloured pixel images of 50,000 training examples and 10,000 testing examples.
  • The quantity PSN was proposed by (Bartlett et al, 2017) and SoSP was proposed by (Long and Sedghi, 2019)
  • Results:

    The authors found that WCD improves the generalisation performance on some levels compared to neural networks that did not use this method.
  • Conclusion:

    Conclusion and Future Work

    The authors have introduced weight correlation and discussed its importance to the generalisation ability of neural networks.
  • Sampling methods for the estimation of high-dimensional posterior distributions, such as Markov chain Monte Carlo (MCMC) sampling, could be employed here.
  • Another practical issue concerns further complexity reduction of WCD: WC is easy to compute in the forward propagation, computing WCD gradients in the back-propagation is more involved—which is crucial for the potential deployment of WCD in very large neural networks
Tables
  • Table1: Complexity Measures (Measured Quantities)
  • Table2: Complexity measures for CIFAR-10
  • Table3: Comparison of different models with and without WCD
  • Table4: Complexity measures for CIFAR-100
  • Table5: The architectures of FCN3, VGG11*, VGG16*, VGG19*
Download tables as Excel
Related work
  • Evaluating the generalisation performance of neural networks has been a research focus since Baum and Haussler (1989). Many generalisation bounds and complexity measures have been proposed so far. Bartlett (1998) highlighted the significance of the norm of the weights in predicting the generalisation error. Since then, various analysis techniques have been proposed. They have either been based on covering number and Rademacher complexity (Bartlett et al, 2017; Neyshabur et al, 2018, 2015), or they have used approaches similar to PAC-Bayes (Neyshabur et al, 2017; Arora et al, 2018; Nagarajan and Kolter, 2019a; Zhou et al, 2018). A number of recent theoretical works have shown that, for a large network initialised in this way, accurate models can be found by traveling a short distance in parameter space (Du et al, 2019; Allen-Zhu et al, 2019). Thus, the required distance from the initialisation may be expected to be significantly smaller than the magnitude of the weights. Furthermore, there is theoretical reason to expect that, as the number of parameters increases, the distance from the initialisation falls. This has motivated works that focus on the role of the distance to initialisation rather than on the norm of the weights in generalisation (Dziugaite and Roy, 2017; Nagarajan and Kolter, 2019b; Long and Sedghi, 2020). Recently, Chatterji et al (2020) introduced module criticality and analysed how different modules in the network interact with each other and influence the generalisation performance as a whole.
Funding
  • GJ is supported by a University of Liverpool PhD scholarship
  • SS is supported by the UK EPSRC project [EP/P020909/1], and XH is supported by the UK EPSRC projects [EP/R026173/1,EP/T026995/1]
  • Both XH and SS are supported by the UK Dstl project [TCMv2]. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 956123
  • LijunZ is supported by the Guangdong Science and Technology Department [Grant no. 2018B010107004], and NSFC [Grant Nos. 61761136011,61532019]
Reference
  • Allen-Zhu, Z., Li, Y., and Song, Z. (2019). A convergence theory for deep learning via overparameterization. International Conference on Machine Learning (ICML).
    Google ScholarLocate open access versionFindings
  • Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. International Conference on Machine Learning (ICML).
    Google ScholarLocate open access versionFindings
  • Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525–536.
    Google ScholarLocate open access versionFindings
  • Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249.
    Google ScholarLocate open access versionFindings
  • Baum, E. B. and Haussler, D. (1989). What size net gives valid generalization? In Advances in neural information processing systems, pages 81–90.
    Google ScholarLocate open access versionFindings
  • Chatterji, N. S., Neyshabur, B., and Sedghi, H. (2020). The intriguing role of module criticality in the generalization of deep networks. International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Cohen, M. R. and Kohn, A. (2011). Measuring and interpreting neuronal correlations. Nature Neuroscience, 14(7):811–819.
    Google ScholarLocate open access versionFindings
  • Du, S. S., Zhai, X., Poczos, B., and Singh, A. (2019). Gradient descent provably optimizes overparameterized neural networks. International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Conference on Uncertainty in Artificial Intelligence (UAI).
    Google ScholarLocate open access versionFindings
  • Goldfeld, Z., Berg, E. v. d., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., and Polyanskiy, Y. (2019). Estimating information flow in deep neural networks. International Conference on Machine Learning (ICML).
    Google ScholarLocate open access versionFindings
  • Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT press.
    Google ScholarFindings
  • He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
    Google ScholarLocate open access versionFindings
  • Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017a). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708.
    Google ScholarLocate open access versionFindings
  • Huang, X., Kwiatkowska, M., Wang, S., and Wu, M. (2017b). Safety verification of deep neural networks. In Majumdar, R. and Kuncak, V., editors, Computer Aided Verification, pages 3–29, Cham. Springer International Publishing.
    Google ScholarLocate open access versionFindings
  • Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. (2020). Fantastic generalization measures and where to find them. International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2):81–93.
    Google ScholarLocate open access versionFindings
  • Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836.
    Findings
  • Kohn, A. and Smith, M. A. (2005). Stimulus dependence of neuronal correlation in primary visual cortex of the macaque. Journal of Neuroscience, 25(14):3661–3673.
    Google ScholarLocate open access versionFindings
  • Kolchinsky, A. and Tracey, B. (2017). Estimating mixture entropy with pairwise distances. Entropy, 19(7):361.
    Google ScholarLocate open access versionFindings
  • Kullback, S. and Leibler, R. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1):79–86.
    Google ScholarLocate open access versionFindings
  • Lombardi, D. and Pant, S. (2016). Nonparametric k-nearest-neighbor entropy estimator. Physical Review E, 93(1):013310.
    Google ScholarLocate open access versionFindings
  • Long, P. and Sedghi, H. (2020). Generalization bounds for deep convolutional neural networks. International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Long, P. M. and Sedghi, H. (2019). Size-free generalization bounds for convolutional neural networks. arXiv preprint arXiv:1905.12600.
    Findings
  • McAllester, D. A. (1999). PAC-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164–170.
    Google ScholarLocate open access versionFindings
  • Nagarajan, V. and Kolter, J. Z. (2019a). Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience. International Conference on Learning Representations (ICLR).
    Google ScholarFindings
  • Nagarajan, V. and Kolter, J. Z. (2019b). Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672.
    Findings
  • Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). Exploring generalization in deep learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 5949–5958, USA. Curran Associates Inc.
    Google ScholarLocate open access versionFindings
  • Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. (2018). Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076.
    Findings
  • Neyshabur, B., Tomioka, R., and Srebro, N. (2015). Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401.
    Google ScholarLocate open access versionFindings
  • Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Singh, S. and Póczos, B. (2017). Nonparanormal information estimation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3210–3219. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.
    Google ScholarLocate open access versionFindings
  • Yao, Y., Rosasco, L., and Caponnetto, A. (2007). On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315.
    Google ScholarLocate open access versionFindings
  • Yi, X. and Au, E. K. (2011). User scheduling for heterogeneous multiuser MIMO systems: a subspace viewpoint. IEEE Transactions on Vehicular Technology, 60(8):4004–4013.
    Google ScholarLocate open access versionFindings
  • Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. (2018). Non-vacuous generalization bounds at the imagenet scale: a PAC-bayesian compression approach. International Conference on Learning Representations (ICLR).
    Google ScholarFindings
Author
Gaojie Jin
Gaojie Jin
Xinping Yi
Xinping Yi
Liang Zhang
Liang Zhang
Your rating :
0

 

Tags
Comments
小科