# Stochastic Training of Graph Convolutional Networks

international conference on machine learning, Volume abs/1710.10568, 2018.

EI

Weibo:

Abstract:

Graph convolutional networks (GCNs) are powerful deep neural networks for graph-structured data. However, GCN computes nodesu0027 representation recursively from their neighbors, making the receptive field size grow exponentially with the number of layers. Previous attempts on reducing the receptive field size by subsampling neighbors do ...More

Code:

Data:

Introduction

- Graph convolution networks (GCNs) (Kipf & Welling, 2017) generalize convolutional neural networks (CNNs) (LeCun et al, 1995) to graph structured data.
- By stacking multiple graph convolution layers, GCNs can learn nodes’ representation by utilizing information from distant neighbors.
- The graph convolution operation makes it difficult to train GCN efficiently.
- Due to the large receptive field size, Kipf & Welling (2017) proposed training GCN by a batch algorithm, which computes the representation for all the nodes altogether.
- Batch algorithms cannot handle large scale datasets because of their slow convergence and the requirement to fit the entire dataset in GPU memory

Highlights

- Graph convolution networks (GCNs) (Kipf & Welling, 2017) generalize convolutional neural networks (CNNs) (LeCun et al, 1995) to graph structured data
- Graph convolution networks have been applied to semi-supervised node classification (Kipf & Welling, 2017), inductive node embedding (Hamilton et al, 2017a), link prediction (Kipf & Welling, 2016; Berg et al, 2017) and knowledge graphs (Schlichtkrull et al, 2017), outperforming multi-layer perceptron (MLP) models that do not use the graph structure and graph embedding approaches (Perozzi et al, 2014; Tang et al, 2015; Grover & Leskovec, 2016) that do not use node features
- We develop novel stochastic training algorithms for Graph convolution networks such that D(l) can be as low as two, so that the time complexity of training Graph convolution networks is comparable with training multi-layer perceptron
- We present a preprocessing strategy and two control variate based algorithms to reduce the receptive field size
- Our algorithms can achieve comparable convergence speed with the exact algorithm even the neighbor sampling size D(l) = 2, so that the per-epoch cost of training Graph convolution networks is comparable with training multi-layer perceptron
- We present strong theoretical guarantees, including exact prediction and convergence to Graph convolution networks’s local optimum, for our control variate based algorithm

Methods

- The authors examine the variance and convergence of the algorithms empirically on six datasets, including Citeseer, Cora, PubMed and NELL from Kipf & Welling (2017) and Reddit, PPI from Hamilton et al (2017a), as summarized in Table 1.
- The authors repeat the convergence experiments 10 times on Citeseer, Cora, PubMed and NELL, and 5 times on Reddit and PPI.
- The authors use M1+PP, D(l) = 20 as the exact baseline in following convergence experiments because it is the fastest among these three settings

Conclusion

- The large receptive field size of GCN hinders its fast stochastic training.
- The authors present a preprocessing strategy and two control variate based algorithms to reduce the receptive field size.
- The authors' algorithms can achieve comparable convergence speed with the exact algorithm even the neighbor sampling size D(l) = 2, so that the per-epoch cost of training GCN is comparable with training MLPs. The authors present strong theoretical guarantees, including exact prediction and convergence to GCN’s local optimum, for the control variate based algorithm

Summary

## Introduction:

Graph convolution networks (GCNs) (Kipf & Welling, 2017) generalize convolutional neural networks (CNNs) (LeCun et al, 1995) to graph structured data.- By stacking multiple graph convolution layers, GCNs can learn nodes’ representation by utilizing information from distant neighbors.
- The graph convolution operation makes it difficult to train GCN efficiently.
- Due to the large receptive field size, Kipf & Welling (2017) proposed training GCN by a batch algorithm, which computes the representation for all the nodes altogether.
- Batch algorithms cannot handle large scale datasets because of their slow convergence and the requirement to fit the entire dataset in GPU memory
## Methods:

The authors examine the variance and convergence of the algorithms empirically on six datasets, including Citeseer, Cora, PubMed and NELL from Kipf & Welling (2017) and Reddit, PPI from Hamilton et al (2017a), as summarized in Table 1.- The authors repeat the convergence experiments 10 times on Citeseer, Cora, PubMed and NELL, and 5 times on Reddit and PPI.
- The authors use M1+PP, D(l) = 20 as the exact baseline in following convergence experiments because it is the fastest among these three settings
## Conclusion:

The large receptive field size of GCN hinders its fast stochastic training.- The authors present a preprocessing strategy and two control variate based algorithms to reduce the receptive field size.
- The authors' algorithms can achieve comparable convergence speed with the exact algorithm even the neighbor sampling size D(l) = 2, so that the per-epoch cost of training GCN is comparable with training MLPs. The authors present strong theoretical guarantees, including exact prediction and convergence to GCN’s local optimum, for the control variate based algorithm

- Table1: Number of vertexes, edges, average number of 1- and 2-neighbors per node for each dataset. Undirected edges are counted twice and self-loops are counted once. Reddit is already subsampled to have a max degree of 128 following <a class="ref-link" id="cHamilton_et+al_2017_a" href="#rHamilton_et+al_2017_a">Hamilton et al (2017a</a>)
- Table2: Variance of different algorithms in the independent Gaussian case
- Table3: Testing accuracy of different algorithms and models after fixed number of epochs. Our implementation does not support M0, D(l) = ∞ on NELL so the result is not reported
- Table4: Time complexity comparison of different algo- axis is Micro-F1 for PPI and accuracy rithms on the Reddit dataset
- Table5: Time to reach 0.95 testing accuracy

Reference

- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Rianne van den Berg, Thomas N Kipf, and Max Welling. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263, 2017.
- Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–86ACM, 2016.
- William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216, 2017a.
- William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017b.
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
- Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
- Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, 2014.
- Brian D Ripley. Stochastic simulation, volume 316. John Wiley & Sons, 2009. Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and
- Max Welling. Modeling relational data with graph convolutional networks. arXiv preprint arXiv:1703.06103, 2017. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. International World Wide Web Conferences Steering Committee, 2015. Sida Wang and Christopher Manning. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 118–126, 2013.
- 1. We prove by induction. After the first epoch the activation h(i,0v) is at least computed once for each node v, so HC(0V),i = HC(0V),i = H(0) for all i > I. Assume that we have HC(lV),i = HC(lV),i = H(l) for all i > (l + 1)I. Then for all i > (l + 1)I
- 2. We omit the time subscript i and denote fCV,v:= f (yv, zC(LV),v). By back propagation, the approximated gradients by CV can be computed as follows
- 2. Lemma 2: For a sequence of weights W (1),..., W (N) which are close to each other, CV’s gradients are close to be unbiased.
- 3. Theorem 2: An SGD algorithm generates the weights that changes slow enough for the gradient bias goes to zero, so the algorithm converges.
- 1. The activation σ(·) is ρ-Lipschitz; 2. XCV,i − XCV,j ∞ < and XCV,i − Xi ∞ < for all i, j ≤ T and > 0. Then there exists some K > 0, s.t., HCV,i − HCV,j ∞ < K and HCV,i − Hi ∞ < K for all I < i, j ≤ T, where I is the number of iterations per epoch.
- 1. Wi − Wj ∞ <, ∀i, j, 2. all the activations are ρ-Lipschitz, 3. the gradient of the cost function ∇zf (y, z) is ρ-Lipschitz and bounded, then there exists K > 0, s.t., EP,VB gCV (Wi) − g(Wi) Proof. This proof is a modification of Ghadimi & Lan (2013), but using biased stochastic gradients instead. We assume the algorithm is already warmed-up for LI steps with the initial weights W0, so that Lemma 2 holds for step i > 0. Denote δi = gCV (Wi) − ∇L(Wi). By smoothness we have
- In this sections we describe the details of our model architectures. We use the Adam optimizer Kingma & Ba (2014) with learning rate 0.01.

Full Text

Tags

Comments