# Near Input Sparsity Time Kernel Embeddings via Adaptive Sampling

ICML 2020, 2020.

Keywords:

Weibo:

Abstract:

To accelerate kernel methods, we propose a near input sparsity time algorithm for sampling the high-dimensional feature space implicitly defined by a kernel transformation. Our main contribution is an importance sampling method for subsampling the feature space of a degree q tensoring of data points in almost input sparsity time, improvin...More

Code:

Data:

Introduction

- Kernel methods provide a simple, yet powerful framework for applying non-parametric modeling techniques to a number of important problems in statistics and machine learning, such as kernel ridge regression, SVM, PCA, CCA, etc.
- (2017) prove that for any kernel K with statistical dimension sλ, there exists an algorithm that outputs a matrix Z ∈ Rs×n with s = O sλ log n which satisfies the spectral approximation guarantee of (1) with high probability, using

Highlights

- Kernel methods provide a simple, yet powerful framework for applying non-parametric modeling techniques to a number of important problems in statistics and machine learning, such as kernel ridge regression, SVM, PCA, CCA, etc
- A classical solution for scaling up kernel methods is via kernel low-rank approximation, where one seeks to find a low-rank matrix Z ∈ Rs×n such that Z Z can serve as a proxy to the kernel matrix K
- Our main result for the the polynomial kernel is given in the following theorem
- In the experiments section we evaluate our approximate kernel ridge regression method on various standard large-scale regression datasets and empirically show that our method competes favorably with the state-of-the-art, including Nystrom (Musco & Musco, 2017) and Fourier features methods (Rahimi & Recht, 2008), as well as the oblivious sketching of (Ahle et al, 2020)
- Additional downstream learning applications: While we focus on kernel ridge regression here, we remark that spectral approximation bounds form the basis of analyzing sketching methods for tasks including kernel low-rank approximation, PCA, CCA, k-means and many more

Results

- The authors start by presenting a recursive importance sampling algorithm that efficiently computes a matrix Z which satisfies the spectral approximation guarantee of (1) for the kernel K = Φ Φ.
- An important technical contribution of this work is an efficient algorithm that can perform row norm sampling on a matrix of the form X⊗q(B B + λI)−1/2 using nearly nnz(X) runtime, where X ∈ Rd×n and B ∈ Rm×n.
- Overview of Algorithm 2: The goal is to generate a sample (i1, i2, · · · iq) ∈ [d]q with probability proportional to the squared norm of the row (i1, · · · iq) of the matrix X⊗q(B B + λI)−1/2.
- Note that the actual procedure requires more work because the authors need to generate s i.i.d. samples with the row norm distribution and to ensure that the runtime does not lose a multiplicative factor of s, resulting in s · nnz(X) total time, the authors need to do extra sketching and a random partitioning of the rows of the matrix X to Θ(q3/2s) buckets.
- For any matrices X ∈ Rd×n and B ∈ Rm×n, any λ > 0 and any positive integers q, s, with high probability, Algorithm 2 outputs a ranks-s row norm sampler for X⊗q(B B + λI)−1/2 (Definition 3.1) in time O m2n + q15/2s2n log3 n + q5/2 log3 n · nnz(X) .
- By union bounding over s events, with high probability, nnz(Xh−1(t), ) = O ((nnz(X)/s + n) log n), simultaneously for all t ∈ [s ] which implies that the distribution {qiq}i∈h−1(t) in line 17 of the algorithm can be computed in time O q2n log2 n + q2 log2 n · nnz(X)/s for a fixed a ∈ [q] and a fixed l ∈ [s].

Conclusion

- Since matrix A is a concatenation of tensor products X⊗j for j = 0, 1, · · · q, using the iterative leverage score sampling procedure for the polynomial kernel the authors can spectrally approximate A A in nearly nnz(X) time.
- The authors can obtain a subspace embedding for the inverse polynomial kernel in nearly nnz(X) time by applying the sampling method from

Summary

- Kernel methods provide a simple, yet powerful framework for applying non-parametric modeling techniques to a number of important problems in statistics and machine learning, such as kernel ridge regression, SVM, PCA, CCA, etc.
- (2017) prove that for any kernel K with statistical dimension sλ, there exists an algorithm that outputs a matrix Z ∈ Rs×n with s = O sλ log n which satisfies the spectral approximation guarantee of (1) with high probability, using
- The authors start by presenting a recursive importance sampling algorithm that efficiently computes a matrix Z which satisfies the spectral approximation guarantee of (1) for the kernel K = Φ Φ.
- An important technical contribution of this work is an efficient algorithm that can perform row norm sampling on a matrix of the form X⊗q(B B + λI)−1/2 using nearly nnz(X) runtime, where X ∈ Rd×n and B ∈ Rm×n.
- Overview of Algorithm 2: The goal is to generate a sample (i1, i2, · · · iq) ∈ [d]q with probability proportional to the squared norm of the row (i1, · · · iq) of the matrix X⊗q(B B + λI)−1/2.
- Note that the actual procedure requires more work because the authors need to generate s i.i.d. samples with the row norm distribution and to ensure that the runtime does not lose a multiplicative factor of s, resulting in s · nnz(X) total time, the authors need to do extra sketching and a random partitioning of the rows of the matrix X to Θ(q3/2s) buckets.
- For any matrices X ∈ Rd×n and B ∈ Rm×n, any λ > 0 and any positive integers q, s, with high probability, Algorithm 2 outputs a ranks-s row norm sampler for X⊗q(B B + λI)−1/2 (Definition 3.1) in time O m2n + q15/2s2n log3 n + q5/2 log3 n · nnz(X) .
- By union bounding over s events, with high probability, nnz(Xh−1(t), ) = O ((nnz(X)/s + n) log n), simultaneously for all t ∈ [s ] which implies that the distribution {qiq}i∈h−1(t) in line 17 of the algorithm can be computed in time O q2n log2 n + q2 log2 n · nnz(X)/s for a fixed a ∈ [q] and a fixed l ∈ [s].
- Since matrix A is a concatenation of tensor products X⊗j for j = 0, 1, · · · q, using the iterative leverage score sampling procedure for the polynomial kernel the authors can spectrally approximate A A in nearly nnz(X) time.
- The authors can obtain a subspace embedding for the inverse polynomial kernel in nearly nnz(X) time by applying the sampling method from

- Table1: The RMSE on the test set along with the total training time of approximate KRR via various approximation methods

Funding

- Woodruff was supported in part by Office of Naval Research (ONR) grant N00014-18-1-2562

Reference

- Ahle, T. D., Kapralov, M., Knudsen, J. B., Pagh, R., Velingker, A., Woodruff, D. P., and Zandieh, A. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 141–160. SIAM, 2020.
- Alaoui, A. and Mahoney, M. W. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems, pp. 775–783, 2015.
- Avron, H., Nguyen, H., and Woodruff, D. Subspace embeddings for the polynomial kernel. In Advances in neural information processing systems, pp. 2258–2266, 2014.
- Avron, H., Clarkson, K. L., and Woodruff, D. P. Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 38 (4):1116–1138, 2017a.
- Avron, H., Kapralov, M., Musco, C., Musco, C., Velingker, A., and Zandieh, A. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 253– 262. JMLR. org, 2017b.
- Cohen, M. B., Musco, C., and Pachocki, J. Online row sampling. arXiv preprint arXiv:1604.05448, 2016.
- Cohen, M. B., Musco, C., and Musco, C. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1758– 177SIAM, 2017.
- Dasgupta, S. and Gupta, A. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
- Kane, D. M. and Nelson, J. Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1):4, 2014.
- Kapralov, M., Lee, Y. T., Musco, C., Musco, C., and Sidford, A. Single pass spectral sparsification in dynamic streams. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 561–570. IEEE, 2014.
- Le, Q., Sarlos, T., and Smola, A. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning, volume 85, 2013.
- Musco, C. and Musco, C. Recursive sampling for the nystrom method. In Advances in Neural Information Processing Systems, pp. 3833–3845, 2017.
- Pham, N. and Pagh, R. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 239–247, 2013.
- Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177–1184, 2008.
- Schoenberg, I. Positive definite functions on spheres. Duke Math. J, 1:172, 1988.
- Zandieh, A., Nouri, N., Velingker, A., Kapralov, M., and Razenshteyn, I. Scaling up kernel ridge regression via locality sensitive hashing. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 4088–4097, Online, 26–28 Aug 2020. PMLR.

Tags

Comments