## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Improved Guarantees for k-means++ and k-means++ Parallel

NIPS 2020, (2020)

EI

Abstract

In this paper, we study k-means++ and k-means++ parallel, the two most popular algorithms for the classic k-means clustering problem. We provide novel analyses and show improved approximation and bi-criteria approximation guarantees for k-means++ and k-means++ parallel. Our results give a better theoretical justification for why these a...More

Code:

Data:

Introduction

- The authors study k-means++ and k-means , the two most popular algorithms for the classic k-means clustering problem.
- The authors improve the bound by Arthur and Vassilvitskii (2007) on the expected cost of a covered cluster in k-means++.
- Let costk+∆ (X) be the cost of the clustering with k + ∆ centers sampled by the k-means++ algorithm.

Highlights

- In this paper, we study k-means++ and k-means, the two most popular algorithms for the classic k-means clustering problem
- In Section 4, we show that for any t, we have E[Ht(X)] ≤ 5OPTk(X), which is an improvement over the bound of 8OPTk(X) given by Arthur and Vassilvitskii (2007)
- We show an upper bound for the expected cost of the solution returned by k-means Pois
- In Section 7.2, we show how to efficiently implement k-means++ER using lazy updates and explain why our algorithm makes R passes over the data set
- In the lazy version of this algorithm, the algorithm makes a pass over the data set and samples a new batch of centers every time this counter is incremented
- We show that the expected size of the set Z is at most l

Results

- The authors prove bound (2) on the expected cost of the clustering returned by k-means++ after k + ∆ rounds.
- The authors use ideas from Arthur and Vassilvitskii (2007), Dasgupta (2013) to prove the following statement: Let them count the cost of uncovered clusters only when the number of misses after k rounds of k-means++ is greater than ∆/2.
- The authors establish the first and second bounds from Theorem 5.1 on the expected cost of the clustering after k + ∆ rounds of k-means.
- Let costk+∆(X) be the cost of the the clustering resulting from sampling k +∆ centers according to the k-means++ algorithm.
- The expected cost of the clustering returned by k-means algorithm after T rounds are upper bounded as follows: for l < k, for l ≥ k, E [costT +1(X)] ≤ E [costT +1(X)] ≤
- The authors show an upper bound for the expected cost of the solution returned by k-means Pois.
- After T rounds of k-means , the expected cost of clustering E [costT (X)] is at most 9OPTk(X).
- The authors can run k-means++ER till it samples exactly k centers; in which case, the distribution of k sampled centers is identical to the distribution of the regular k-means++, and the expected number of rounds or passes over the data set R is upper bounded by
- The algorithm chooses the first center c1 uniformly at random in X and sets the arrival rate of each process Pt(x) to be λt(x) = cost(x, {c1}).

Conclusion

- When process Pt(x) jumps, the algorithm adds the point x ∈ X to the set of centers Ct and updates the arrival rates of all processes to be λt(y) = cost(y, Ct) for all y ∈ X.
- The right hand side is the probability that the Poisson process Qs(x) with rate 1 jumps in the interval of length l · cost(x, Cti)/cost(X, Cti ) which is upper bounded by the expected number of jumps of Qs(x) in this interval.
- According to the analysis above, the number of new centers chosen at each round of k-means++ER is at most the size of set Z, which is O(l) with high probability.

Summary

- The authors study k-means++ and k-means , the two most popular algorithms for the classic k-means clustering problem.
- The authors improve the bound by Arthur and Vassilvitskii (2007) on the expected cost of a covered cluster in k-means++.
- Let costk+∆ (X) be the cost of the clustering with k + ∆ centers sampled by the k-means++ algorithm.
- The authors prove bound (2) on the expected cost of the clustering returned by k-means++ after k + ∆ rounds.
- The authors use ideas from Arthur and Vassilvitskii (2007), Dasgupta (2013) to prove the following statement: Let them count the cost of uncovered clusters only when the number of misses after k rounds of k-means++ is greater than ∆/2.
- The authors establish the first and second bounds from Theorem 5.1 on the expected cost of the clustering after k + ∆ rounds of k-means.
- Let costk+∆(X) be the cost of the the clustering resulting from sampling k +∆ centers according to the k-means++ algorithm.
- The expected cost of the clustering returned by k-means algorithm after T rounds are upper bounded as follows: for l < k, for l ≥ k, E [costT +1(X)] ≤ E [costT +1(X)] ≤
- The authors show an upper bound for the expected cost of the solution returned by k-means Pois.
- After T rounds of k-means , the expected cost of clustering E [costT (X)] is at most 9OPTk(X).
- The authors can run k-means++ER till it samples exactly k centers; in which case, the distribution of k sampled centers is identical to the distribution of the regular k-means++, and the expected number of rounds or passes over the data set R is upper bounded by
- The algorithm chooses the first center c1 uniformly at random in X and sets the arrival rate of each process Pt(x) to be λt(x) = cost(x, {c1}).
- When process Pt(x) jumps, the algorithm adds the point x ∈ X to the set of centers Ct and updates the arrival rates of all processes to be λt(y) = cost(y, Ct) for all y ∈ X.
- The right hand side is the probability that the Poisson process Qs(x) with rate 1 jumps in the interval of length l · cost(x, Cti)/cost(X, Cti ) which is upper bounded by the expected number of jumps of Qs(x) in this interval.
- According to the analysis above, the number of new centers chosen at each round of k-means++ER is at most the size of set Z, which is O(l) with high probability.

Funding

- Konstantin Makarychev, Aravind Reddy, and Liren Shan were supported in part by NSF grants CCF1955351 and HDR TRIPODS CCF-1934931
- Aravind Reddy was also supported in part by NSF CCF-1637585. Given a set of points X = {x1, x2, · · · , xn} ⊆ Rd and an integer k ≥ 1, the k-means clustering problem is to find a set C of k centers in Rd to minimize cost(X, C) := min x − c 2. x∈X c∈C For any integer i ≥ 1, let us define OPTi(X) := min|C|=i cost (X, C)

Study subjects and analysis

cases: 3

Then, we need to show that f (x, y) ≤ 5cost(x, C) x − y 2. Consider three cases. Case 1: If cost(x, C) ≤ cost(y, C) ≤ x − y 2, then f (x, y) = 2cost(x, C)cost(y, C) ≤ 2cost(x, C) x − y 2

Reference

- S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, pages FOCS17–97, 2019.
- D. Aloise, A. Deshpande, P. Hansen, and P. Popat. Np-hardness of euclidean sum-of-squares clustering. Machine learning, 75(2):245–248, 2009.
- D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
- P. Awasthi, M. Charikar, R. Krishnaswamy, and A. K. Sinop. The hardness of approximation of euclidean k-means. In 31st International Symposium on Computational Geometry (SoCG 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2015.
- O. Bachem, M. Lucic, and A. Krause. Distributed and provably good seedings for k-means in constant rounds. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 292–300. JMLR. org, 2017.
- B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7):622–633, 2012.
- L. Becchetti, M. Bury, V. Cohen-Addad, F. Grandoni, and C. Schwiegelshohn. Oblivious dimension reduction for k-means: beyond subspaces and the johnson–lindenstrauss lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1039–1050, 2019.
- B. Boehmke and B. M. Greenwell. Hands-on machine learning with R. CRC Press, 2019.
- C. Boutsidis, A. Zouzias, and P. Drineas. Random projections for k-means clustering. In Advances in Neural Information Processing Systems, pages 298–306, 2010.
- T. Brunsch and H. Röglin. A bad instance for k-means++. Theoretical Computer Science, 505: 19–26, 2013.
- D. Choo, C. Grunau, J. Portmann, and V. Rozhoň. k-means++: few more steps yield constant approximation. In Proceedings of the 37th International Conference on Machine Learning, pages 7849–7057. JMLR. org, 2020.
- S. Dasgupta. The hardness of k-means clustering. Department of Computer Science and Engineering, University of California, San Diego, 2008.
- S. Dasgupta. UCSD CSE 291, Lecture Notes: Geometric Algorithms, 20URL: https://cseweb.ucsd.edu/~dasgupta/291-geom/kmeans.pdf. Last visited on 2020/06/01.
- D. Dua and C. Graff. UCI ML repository, 2017. URL http://archive.ics.uci.edu/ml.
- P. S. Efraimidis and P. G. Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181–185, 2006.
- R. Elber. Kdd-Cup, 2004. URL http://osmot.cs.cornell.edu/kddcup/.
- W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
- T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2): 89 – 112, 2004. ISSN 0925-7721. doi: https://doi.org/10.1016/j.comgeo.2004.03.003. URL http://www.sciencedirect.com/science/article/pii/S0925772104000215.
- S. Lattanzi and C. Sohler. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pages 3662–3671, 2019.
- E. Lee, M. Schmidt, and J. Wright. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40–43, 2017.
- S. Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2): 129–137, 1982.
- K. Makarychev, Y. Makarychev, M. Sviridenko, and J. Ward. A bi-criteria approximation algorithm for k-means. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 2016.
- K. Makarychev, Y. Makarychev, and I. Razenshteyn. Performance of johnson–lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1027–1038, 2019.
- M. Mitzenmacher and E. Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017.
- R. Ostrovsky, Y. Rabani, L. Schulman, and C. Swamy. The effectiveness of lloyd-type methods for the k-means problem. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), 2006.
- V. Rozhoň. Simple and sharp analysis of k-means||. In Proceedings of the 37th International Conference on Machine Learning, pages 7828–7837. JMLR. org, 2020.
- D. Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Advances in Neural Information Processing Systems, pages 604–612, 2016.

Tags

Comments