Compressing the Gram Matrix for Learning Neural Networks in Polynomial Time

neural information processing systems(2017)

引用 23|浏览22
暂无评分
摘要
We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit efficient algorithms. In this work we show that a natural distributional assumption on eigenvalue decay of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). We make no other assumptions on the network architecture or the labels. Given sufficiently-strong polynomial eigenvalue decay, we obtain fully-polynomial time algorithms in all the parameters with respect to square-loss. Milder decay also leads to improved algorithms. We are not aware of any prior work where an assumption on the marginal distribution alone leads to polynomial-time algorithms for networks of ReLUs, even with one hidden layer. Unlike prior assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has been observed in practice on common data sets. Our algorithm applies to any function class that can be embedded in a suitable RKHS. The main technical contribution is a new approach for proving generalization bounds for kernelized regression using Compression Schemes as opposed to Rademacher bounds. In general, it is known that sample-complexity bounds for kernel methods must depend on the norm of the corresponding RKHS, which can quickly become large depending on the kernel function employed. We sidestep these worst-case bounds by sparsifying the Gram matrix using recent work on recursive Nystrom sampling due to Musco and Musco. We prove that our approximate, sparse hypothesis admits a compression scheme whose true error depends on the rate of eigenvalue decay.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要