Training Overparametrized Neural Networks in Sublinear Time
arxiv(2022)
摘要
The success of deep learning comes at a tremendous computational and energy
cost, and the scalability of training massively overparametrized neural
networks is becoming a real barrier to the progress of artificial intelligence
(AI). Despite the popularity and low cost-per-iteration of traditional
backpropagation via gradient decent, stochastic gradient descent (SGD) has
prohibitive convergence rate in non-convex settings, both in theory and
practice.
To mitigate this cost, recent works have proposed to employ alternative
(Newton-type) training methods with much faster convergence rate, albeit with
higher cost-per-iteration. For a typical neural network with
m=poly(n) parameters and input batch of n datapoints in
ℝ^d, the previous work of [Brand, Peng, Song, and Weinstein,
ITCS'2021] requires ∼ mnd + n^3 time per iteration. In this paper, we
present a novel training method that requires only m^1-α n d + n^3
amortized time in the same overparametrized regime, where α∈ (0.01,1)
is some fixed constant. This method relies on a new and alternative view of
neural networks, as a set of binary search trees, where each iteration
corresponds to modifying a small subset of the nodes in the tree. We believe
this view would have further applications in the design and analysis of deep
neural networks (DNNs).
更多查看译文
关键词
neural networks,training,time
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要