# Approximate Multiplication of Sparse Matrices with Limited Space

Weibo:

Abstract:

Approximate matrix multiplication with limited space has received ever-increasing attention due to the emergence of large-scale applications. Recently, based on a popular matrix sketching algorithm---frequent directions, previous work has introduced co-occuring directions (COD) to reduce the approximation error for this problem. Althoug...More

Code:

Data:

Introduction

- Matrix multiplication refers to computing the product XY T of two matrices X ∈ Rmx×n and Y ∈ Rmy×n, which is a fundamental task in many machine learning applications such as regression (Naseem et al, 2010; Cohen et al, 2015), online learning (Hazan et al, 2007; Chu et al, 2011), information retrieval (Eriksson-Bique et al, 2011) and canonical correlation analysis (Hotelling, 1936; Chen et al, 2015).
- Given two large matrices X ∈ Rmx×n and Y ∈ Rmy×n, the goal of AMM with limited space is to find two small sketches BX ∈ Rmx×l and BY ∈ Rmy×l such that BX BYT approximates XY T well, where l ≪ min(mx, my, n) is the sketch size
- Randomized techniques such as column selection (Drineas et al, 2006) and random projection (Sarlos, 2006; Magen and Zouzias, 2011; Cohen et al, 2015) have been utilized to develop lightweight algorithms with the O(n(mx + my)l) time complexity and O((mx + my)l) space complexity for AMM, and yielded theoretical guarantees for the approximation error.
- Early studies (Drineas et al, 2006; Sarlos, 2006) focused on the Frobenius error, and achieved the following bound

Highlights

- Matrix multiplication refers to computing the product XY T of two matrices X ∈ Rmx×n and Y ∈ Rmy×n, which is a fundamental task in many machine learning applications such as regression (Naseem et al, 2010; Cohen et al, 2015), online learning (Hazan et al, 2007; Chu et al, 2011), information retrieval (Eriksson-Bique et al, 2011) and canonical correlation analysis (Hotelling, 1936; Chen et al, 2015)
- We report the runtime of each algorithm to verify the efficiency of our sparse co-occuring directions (SCOD)
- From the comparison of runtime, we find that our SCOD is significantly faster than co-occuring directions (COD), FD-approximate matrix multiplication (AMM) and random projection (RP) among different l
- We again find that our SCOD is faster than COD and FD-AMM, and achieves better performance among different l
- We propose SCOD to reduce the time complexity of COD for approximate multiplication of sparse matrices with the O ((mx + my + l)l) space complexity
- In terms of approximation error and projection error, our SCOD matches or improves the performance of COD among different l
- The theoretical guarantee of our algorithm is almost the same as that of COD up to a constant factor. Experiments on both synthetic and real datasets demonstrate the advantage of our SCOD for handling sparse matrices

Methods

- The authors conduct experiments on two synthetic datasets and two real datasets: NIPS conference papers (Perrone et al, 2017) and MovieLens 10M3.
- A noisy low-rank dataset is generated by adding a sparse noise to the above low-rank matrices as.
- NIPS conference papers dataset is originally a 11463×5811 word-by-document matrix M , which contains the distribution of words in 5811 papers published between the years 1987 and 2015.
- The authors let XT be the first 5338 columns of M and Y T be the others

Results

- The authors first introduce a boosted version of simultaneous iteration, which is necessary for controlling the failure probability of the algorithm.
- From the comparison of runtime, the authors find that the SCOD is significantly faster than COD, FD-AMM and RP among different l.
- Fig. 3 and 4 show the results of SCOD, COD and FD-AMM among different l on the real datasets.
- The results of CS and RP are omitted, because they are much worse than SCOD, COD and FD-AMM.
- The authors again find that the SCOD is faster than COD and FD-AMM, and achieves better performance among different l

Conclusion

- The authors propose SCOD to reduce the time complexity of COD for approximate multiplication of sparse matrices with the O ((mx + my + l)l) space complexity.
- The time complexity of the SCOD is O (nnz(X) + nnz(Y ))l + nl , which is much tighter than O (n(mx + my + l)l) of COD for sparse matrices.
- The theoretical guarantee of the algorithm is almost the same as that of COD up to a constant factor
- Experiments on both synthetic and real datasets demonstrate the advantage of the SCOD for handling sparse matrices

Summary

## Introduction:

Matrix multiplication refers to computing the product XY T of two matrices X ∈ Rmx×n and Y ∈ Rmy×n, which is a fundamental task in many machine learning applications such as regression (Naseem et al, 2010; Cohen et al, 2015), online learning (Hazan et al, 2007; Chu et al, 2011), information retrieval (Eriksson-Bique et al, 2011) and canonical correlation analysis (Hotelling, 1936; Chen et al, 2015).- Given two large matrices X ∈ Rmx×n and Y ∈ Rmy×n, the goal of AMM with limited space is to find two small sketches BX ∈ Rmx×l and BY ∈ Rmy×l such that BX BYT approximates XY T well, where l ≪ min(mx, my, n) is the sketch size
- Randomized techniques such as column selection (Drineas et al, 2006) and random projection (Sarlos, 2006; Magen and Zouzias, 2011; Cohen et al, 2015) have been utilized to develop lightweight algorithms with the O(n(mx + my)l) time complexity and O((mx + my)l) space complexity for AMM, and yielded theoretical guarantees for the approximation error.
- Early studies (Drineas et al, 2006; Sarlos, 2006) focused on the Frobenius error, and achieved the following bound
## Methods:

The authors conduct experiments on two synthetic datasets and two real datasets: NIPS conference papers (Perrone et al, 2017) and MovieLens 10M3.- A noisy low-rank dataset is generated by adding a sparse noise to the above low-rank matrices as.
- NIPS conference papers dataset is originally a 11463×5811 word-by-document matrix M , which contains the distribution of words in 5811 papers published between the years 1987 and 2015.
- The authors let XT be the first 5338 columns of M and Y T be the others
## Results:

The authors first introduce a boosted version of simultaneous iteration, which is necessary for controlling the failure probability of the algorithm.- From the comparison of runtime, the authors find that the SCOD is significantly faster than COD, FD-AMM and RP among different l.
- Fig. 3 and 4 show the results of SCOD, COD and FD-AMM among different l on the real datasets.
- The results of CS and RP are omitted, because they are much worse than SCOD, COD and FD-AMM.
- The authors again find that the SCOD is faster than COD and FD-AMM, and achieves better performance among different l
## Conclusion:

The authors propose SCOD to reduce the time complexity of COD for approximate multiplication of sparse matrices with the O ((mx + my + l)l) space complexity.- The time complexity of the SCOD is O (nnz(X) + nnz(Y ))l + nl , which is much tighter than O (n(mx + my + l)l) of COD for sparse matrices.
- The theoretical guarantee of the algorithm is almost the same as that of COD up to a constant factor
- Experiments on both synthetic and real datasets demonstrate the advantage of the SCOD for handling sparse matrices

Funding

- In information retrieval, the word-by-document matrix could contain less than 5%
- In recommender systems, the user-item rating matrix could contain less than 7% non-zero entries (Zhang et al, 2017)
- 0.5 (a) Approximation Error (b) Projection Error (c) Runtime where X and Y contain less than 2% non-zero entries
- In terms of approximation error and projection error, our SCOD matches or improves the performance of COD among different l

Study subjects and analysis

papers: 5811

With the same r, a noisy low-rank dataset is generated by adding a sparse noise to the above low-rank matrices as

X = sprand(1e3, 1e4, 0.01, r) + sprand(1e3, 1e4, 0.01) Y = sprand(2e3, 1e4, 0.01, r) + sprand(2e3, 1e4, 0.01)

2. https://archive.ics.uci.edu/ml/datasets/NIPS+Conference+Papers+1987-2015 3. https://grouplens.org/datasets/movielens/10m/

(a) Approximation Error (b) Projection Error (c) Runtime where X and Y contain less than 2% non-zero entries. Moreover, NIPS conference papers dataset is originally a 11463×5811 word-by-document matrix M , which contains the distribution of words in 5811 papers published between the years 1987 and 2015. In our experiment, let XT be the first 2905 columns of M , and let Y T be the others

X = sprand(1e3, 1e4, 0.01, r) + sprand(1e3, 1e4, 0.01) Y = sprand(2e3, 1e4, 0.01, r) + sprand(2e3, 1e4, 0.01)

2. https://archive.ics.uci.edu/ml/datasets/NIPS+Conference+Papers+1987-2015 3. https://grouplens.org/datasets/movielens/10m/

(a) Approximation Error (b) Projection Error (c) Runtime where X and Y contain less than 2% non-zero entries. Moreover, NIPS conference papers dataset is originally a 11463×5811 word-by-document matrix M , which contains the distribution of words in 5811 papers published between the years 1987 and 2015. In our experiment, let XT be the first 2905 columns of M , and let Y T be the others

papers: 5811

0.5 (a) Approximation Error (b) Projection Error (c) Runtime where X and Y contain less than 2% non-zero entries. Moreover, NIPS conference papers dataset is originally a 11463×5811 word-by-document matrix M , which contains the distribution of words in 5811 papers published between the years 1987 and 2015. In our experiment, let XT be the first 2905 columns of M , and let Y T be the others

Reference

- Xi Chen, Han Liu, and Jaime G. Carbonell. Structured sparse canonical correlation analysis. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 199–207, 2015.
- Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
- Michael B. Cohen, Jelani Nelson, and David P. Woodruff. Optimal approximate matrix product in terms of stable rank. arXiv:1507.02268, 2015.
- Inderjit S. Dhillon and Dharmendra S. Modha. Concept decompositions for large sparse text data using clustering. Machine learning, 42:143–175, 2001.
- Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication. SIAM Journal on Computing, 36(1): 132–157, 2006.
- Sylvester Eriksson-Bique, Mary Solbrig, Michael Stefanelli, Sarah Warkentin, Ralph Abbey, and Ilse C. F. Ipsen. Importance sampling for a Monte Carlo matrix multiplication algorithm, with application to information retrieval. SIAM Journal on Scientific Computing, 33(4):1689–1706, 2011.
- Mina Ghashami and Jeff M. Phillips. Relative errors for deterministic low-rank matrix approximations. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 707–717, 2014.
- Mina Ghashami, Edo Liberty, and Jeff M. Phillips. Efficient frequent directions algorithm for sparse matrices. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 845–854, 2016a.
- Mina Ghashami, Edo Liberty, Jeff M. Phillips, and David P. Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762– 1792, 2016b.
- Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011.
- Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169–192, 2007.
- Harold Hotelling. Relations between two sets of variate. Biometrika, 28:321–377, 1936.
- Ilja Kuzborskij, Leonardo Cella, and Nicolo Cesa-Bianchi. Efficient linear bandits through matrix sketching. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 177–185, 2019.
- Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 581– 588, 2013.
- Avner Magen and Anastasios Zouzias. Low rank matrix-valued Chernoff bounds and approximate matrix multiplication. In Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1422–1436, 2011.
- Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. Co-occuring directions sketching for approximate matrix multiply. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 567–575, 2017.
- Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems 28, pages 1396–1404, 2015.
- Imran Naseem, Roberto Togneri, and Mohammed Bennamoun. Linear regression for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(11): 2106–2112, 2010.
- Valerio Perrone, Paul A. Jenkins, Dario Spano, and Yee Whye Teh. Poisson random fields for dynamic feature models. Journal of Machine Learning Research, 18:1–45, 2017.
- Vladimir Rokhlin, Arthur Szlam, and Mark Tygert. A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications, 31(3):1100– 1124, 2009.
- Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10, 2006.
- Rafi Witten and Emmanuel J. Candes. Randomized algorithms for low-rank matrix factorizations: Sharp performance bounds. Algorithmica, 31(3):1–18, 2014.
- David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Machine Learning, 10(1–2):1–157, 2014.
- Qiaomin Ye, Luo Luo, and Zhihua Zhang. Frequent direction algorithms for approximate matrix multiplication with applications in CCA. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, pages 2301–2307, 2016.
- Shuai Zhang, Lina Yao, and Xiwei Xu. AutoSVD++: An efficient hybrid collaborative filtering model via contractive auto-encoders. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 957–960, 2017.

Full Text

Tags

Comments