AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We show that in finite width deep neural network with rectified linear unit activations, neural path feature are learnt continuously during training, and such learning is key for generalisation

Neural Path Features and Neural Path Kernel : Understanding the role of gates in deep learning

NIPS 2020, (2020)

Cited: 4|Views49
EI
Full Text
Bibtex
Weibo

Abstract

Rectified linear unit (ReLU) activations can also be thought of as \emph{gates}, which, either pass or stop their pre-activation input when they are \emph{on} (when the pre-activation input is positive) or \emph{off} (when the pre-activation input is negative) respectively. A deep neural network (DNN) with ReLU activations has many gate...More

Code:

Data:

0
Introduction
  • The authors consider deep neural networks (DNNs) with rectified linear unit (ReLU) activations.
  • There is a corresponding active sub-network consisting of those gates which are 1, and the weights which pass through such gates.
  • This active sub-network can be said to hold the memory for a given input, i.e., only those weights that pass through such active gates contribute to the output.
Highlights
  • We consider deep neural networks (DNNs) with rectified linear unit (ReLU) activations
  • We show that in finite width DNNs with ReLU activations, neural path feature (NPF) are learnt continuously during training, and such learning is key for generalisation
  • We describe the gradient descent dynamics taking into the dynamics of the neural path value (NPV) and the NPFs
  • We studied the role of active sub-networks in deep learning by encoding the gates in the neural path features
  • We showed that the neural path features are learnt during training and such learning is key for generalisation
  • We observed that almost all information of a trained DNN is stored in the neural path features
Methods
  • 1. Finite Vs Infinite width alone is not enough to explain the performance gain of CNN: Both FRNPF and ReLU are finite width networks.
  • 2. NPF Learning Vs No NPF Learning is key to explain the performance gain of CNN: FLNPF with weights copied from a fully trained ReLU performs close to 79.68% which is almost as good as ReLU’s 80.43%.
Conclusion
  • The authors studied the role of active sub-networks in deep learning by encoding the gates in the neural path features.
  • The authors showed that the neural path features are learnt during training and such learning is key for generalisation.
  • The authors observed that almost all information of a trained DNN is stored in the neural path features.
  • The authors conclude by saying that understanding deep learning requires understanding neural path feature learning
Tables
  • Table1: DNN with ReLU activation. Here, x ∈ Rdin is the input to the DNN, and yΘ(x) is the output, ‘q’s are pre-activation inputs, ‘z’s are output of the hidden layers, ‘G’s are the gating values
  • Table2: Shows the generalisation performance of different NPFs learning settings. The values in the table are averaged over 5 runs. Here, FC is a fully connected network with w = 100 and d = 5. VCONV and GCONV denote Vanilla CNN and CNN with GAP respectively. Please check
  • Table3: Deep Gated Network with padding. Here the gating values are padded, i.e., Gx,t(l, kw +i) =
  • Table4: Memorisation Network. The input is fixed and is equal to 1. All the internal variables depend on the index s and the parameter Θt. The gating values Gs are external and independent variables
Download tables as Excel
Related work
  • Jacot et al [2018] showed the NTK to be the central quantity in the study of generalisation properties of infinite width DNNs. Jacot et al [2019] identify two regimes that occur at initialisation in fully connected DNNs as the width increases to infinity namely i) freeze: here, the (scaled) NTK converges to a constant and hence leads to slow training, and ii) chaos: here, the NTK converges to Kronecker delta and hence hurts generalisation. Jacot et al [2019] also suggest that for good generalisation it is important to operate the DNNs at the edge of the freeze and the chaos regimes. Arora et al [2019] proposed pure kernel method based on the infinite width CNTK (NTK of convolutional neural network) and showed that it out performed state-of-the-art kernel methods by 10%. Arora et al [2019] also noted a performance gain (about 5 − 6%) of the CNNs over the CNTK. However, it was also noted by Arora et al [2019], Lee et al [2019] that random NTFs obtained from finite width neural networks do not perform as well as their limiting infinite width counterparts. Arora et al [2019], Cao and Gu [2019] provided generalisation bounds with the NTK norm. Du et al [2018] use the NTK to show that over-parameterised DNNs trained by gradient descent achieve zero training error. Du and Hu [2019], Shamir [2019], Saxe et al [2013] studied deep linear networks. Since deep linear networks are special cases of deep gated networks, Theorem 5.1 of our paper also provides an expression for the NTK at initialisation of deep linear networks. To see this, in the case of deep linear networks, all the gates are always 1 for all input examples, and ΛΘ will be a matrix whose entries will be w(d−1).
Reference
  • Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8139–8148, 2019.
    Google ScholarLocate open access versionFindings
  • Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
    Findings
  • Randall Balestriero et al. A spline theory of deep learning. In International Conference on Machine Learning, pages 374–383, 2018.
    Google ScholarLocate open access versionFindings
  • Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, pages 10835–10845, 2019.
    Google ScholarLocate open access versionFindings
  • Simon S Du and Wei Hu. Width provably matters in optimization for deep linear neural networks. arXiv preprint arXiv:1901.08572, 2019.
    Findings
  • Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.
    Findings
  • Jonathan Fiat, Eran Malach, and Shai Shalev-Shwartz. Decoupling gating from linearity. CoRR, abs/1906.05032, 2019. URL http://arxiv.org/abs/1906.05032.
    Findings
  • Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
    Findings
  • Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
    Google ScholarLocate open access versionFindings
  • Arthur Jacot, Franck Gabriel, and Clément Hongler. Freeze and chaos for dnns: an ntk view of batch normalization, checkerboard and boundary effects. arXiv preprint arXiv:1907.05715, 2019.
    Findings
  • Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha SohlDickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8570–8581, 2019.
    Google ScholarLocate open access versionFindings
  • Behnam Neyshabur, Russ R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2422–2430, 2015.
    Google ScholarLocate open access versionFindings
  • Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s hidden in a randomly weighted neural network? arXiv preprint arXiv:1911.13299, 2019.
    Findings
  • Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
    Findings
  • Ohad Shamir. Exponential convergence time of gradient descent for one-dimensional deep linear neural networks. In Conference on Learning Theory, pages 2691–2713, 2019.
    Google ScholarLocate open access versionFindings
  • Rupesh Kumar Srivastava, Jonathan Masci, Faustino Gomez, and Jürgen Schmidhuber. Understanding locally competitive networks. arXiv preprint arXiv:1410.1165, 2014.
    Findings
  • 2. To train CIFAR-10, we used a Vanilla CNN architecture denoted by VCONV and a CNN architecture with global-average-pooling denoted by GCONV. VCONV is an architecture without pooling, residual connections, dropout or batch-normalisations, and is given by: input layer is (32, 32, 3), followed by convolution layers with a stride of (3, 3) and channels 64, 64, 128, 128 followed by a flattening to layer with 256 hidden units, followed by a fully connected layer with 256 units, and finally a 10 width soft-max layer to produce the final predictions. GCONV is same as VCONV with a global-average-pooling (GAP) layer at the boundary between the convolutional and fully connected layers.
    Google ScholarLocate open access versionFindings
  • 2. In the case, DNPFL, we let χF = χr, and Gx,t(l) = γsr qxF,t(l)
    Google ScholarFindings
  • 100. We use the vanilla SGD-optimiser.
    Google ScholarFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn