AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
In Section 4, we show that for any t, we have E ≤ 5OPTk(X), which is an improvement over the bound of 8OPTk(X) given by Arthur and Vassilvitskii

Improved Guarantees for k-means++ and k-means++ Parallel

NIPS 2020, (2020)

Cited by: 0|Views11
EI
Full Text
Bibtex
Weibo

Abstract

In this paper, we study k-means++ and k-means++ parallel, the two most popular algorithms for the classic k-means clustering problem. We provide novel analyses and show improved approximation and bi-criteria approximation guarantees for k-means++ and k-means++ parallel. Our results give a better theoretical justification for why these a...More

Code:

Data:

0
Introduction
  • The authors study k-means++ and k-means , the two most popular algorithms for the classic k-means clustering problem.
  • The authors improve the bound by Arthur and Vassilvitskii (2007) on the expected cost of a covered cluster in k-means++.
  • Let costk+∆ (X) be the cost of the clustering with k + ∆ centers sampled by the k-means++ algorithm.
Highlights
  • In this paper, we study k-means++ and k-means, the two most popular algorithms for the classic k-means clustering problem
  • In Section 4, we show that for any t, we have E[Ht(X)] ≤ 5OPTk(X), which is an improvement over the bound of 8OPTk(X) given by Arthur and Vassilvitskii (2007)
  • We show an upper bound for the expected cost of the solution returned by k-means Pois
  • In Section 7.2, we show how to efficiently implement k-means++ER using lazy updates and explain why our algorithm makes R passes over the data set
  • In the lazy version of this algorithm, the algorithm makes a pass over the data set and samples a new batch of centers every time this counter is incremented
  • We show that the expected size of the set Z is at most l
Results
  • The authors prove bound (2) on the expected cost of the clustering returned by k-means++ after k + ∆ rounds.
  • The authors use ideas from Arthur and Vassilvitskii (2007), Dasgupta (2013) to prove the following statement: Let them count the cost of uncovered clusters only when the number of misses after k rounds of k-means++ is greater than ∆/2.
  • The authors establish the first and second bounds from Theorem 5.1 on the expected cost of the clustering after k + ∆ rounds of k-means.
  • Let costk+∆(X) be the cost of the the clustering resulting from sampling k +∆ centers according to the k-means++ algorithm.
  • The expected cost of the clustering returned by k-means algorithm after T rounds are upper bounded as follows: for l < k, for l ≥ k, E [costT +1(X)] ≤ E [costT +1(X)] ≤
  • The authors show an upper bound for the expected cost of the solution returned by k-means Pois.
  • After T rounds of k-means , the expected cost of clustering E [costT (X)] is at most 9OPTk(X).
  • The authors can run k-means++ER till it samples exactly k centers; in which case, the distribution of k sampled centers is identical to the distribution of the regular k-means++, and the expected number of rounds or passes over the data set R is upper bounded by
  • The algorithm chooses the first center c1 uniformly at random in X and sets the arrival rate of each process Pt(x) to be λt(x) = cost(x, {c1}).
Conclusion
  • When process Pt(x) jumps, the algorithm adds the point x ∈ X to the set of centers Ct and updates the arrival rates of all processes to be λt(y) = cost(y, Ct) for all y ∈ X.
  • The right hand side is the probability that the Poisson process Qs(x) with rate 1 jumps in the interval of length l · cost(x, Cti)/cost(X, Cti ) which is upper bounded by the expected number of jumps of Qs(x) in this interval.
  • According to the analysis above, the number of new centers chosen at each round of k-means++ER is at most the size of set Z, which is O(l) with high probability.
Summary
  • The authors study k-means++ and k-means , the two most popular algorithms for the classic k-means clustering problem.
  • The authors improve the bound by Arthur and Vassilvitskii (2007) on the expected cost of a covered cluster in k-means++.
  • Let costk+∆ (X) be the cost of the clustering with k + ∆ centers sampled by the k-means++ algorithm.
  • The authors prove bound (2) on the expected cost of the clustering returned by k-means++ after k + ∆ rounds.
  • The authors use ideas from Arthur and Vassilvitskii (2007), Dasgupta (2013) to prove the following statement: Let them count the cost of uncovered clusters only when the number of misses after k rounds of k-means++ is greater than ∆/2.
  • The authors establish the first and second bounds from Theorem 5.1 on the expected cost of the clustering after k + ∆ rounds of k-means.
  • Let costk+∆(X) be the cost of the the clustering resulting from sampling k +∆ centers according to the k-means++ algorithm.
  • The expected cost of the clustering returned by k-means algorithm after T rounds are upper bounded as follows: for l < k, for l ≥ k, E [costT +1(X)] ≤ E [costT +1(X)] ≤
  • The authors show an upper bound for the expected cost of the solution returned by k-means Pois.
  • After T rounds of k-means , the expected cost of clustering E [costT (X)] is at most 9OPTk(X).
  • The authors can run k-means++ER till it samples exactly k centers; in which case, the distribution of k sampled centers is identical to the distribution of the regular k-means++, and the expected number of rounds or passes over the data set R is upper bounded by
  • The algorithm chooses the first center c1 uniformly at random in X and sets the arrival rate of each process Pt(x) to be λt(x) = cost(x, {c1}).
  • When process Pt(x) jumps, the algorithm adds the point x ∈ X to the set of centers Ct and updates the arrival rates of all processes to be λt(y) = cost(y, Ct) for all y ∈ X.
  • The right hand side is the probability that the Poisson process Qs(x) with rate 1 jumps in the interval of length l · cost(x, Cti)/cost(X, Cti ) which is upper bounded by the expected number of jumps of Qs(x) in this interval.
  • According to the analysis above, the number of new centers chosen at each round of k-means++ER is at most the size of set Z, which is O(l) with high probability.
Funding
  • Konstantin Makarychev, Aravind Reddy, and Liren Shan were supported in part by NSF grants CCF1955351 and HDR TRIPODS CCF-1934931
  • Aravind Reddy was also supported in part by NSF CCF-1637585. Given a set of points X = {x1, x2, · · · , xn} ⊆ Rd and an integer k ≥ 1, the k-means clustering problem is to find a set C of k centers in Rd to minimize cost(X, C) := min x − c 2. x∈X c∈C For any integer i ≥ 1, let us define OPTi(X) := min|C|=i cost (X, C)
Study subjects and analysis
cases: 3
Then, we need to show that f (x, y) ≤ 5cost(x, C) x − y 2. Consider three cases. Case 1: If cost(x, C) ≤ cost(y, C) ≤ x − y 2, then f (x, y) = 2cost(x, C)cost(y, C) ≤ 2cost(x, C) x − y 2

Reference
  • S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, pages FOCS17–97, 2019.
    Google ScholarLocate open access versionFindings
  • D. Aloise, A. Deshpande, P. Hansen, and P. Popat. Np-hardness of euclidean sum-of-squares clustering. Machine learning, 75(2):245–248, 2009.
    Google ScholarLocate open access versionFindings
  • D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
    Google ScholarLocate open access versionFindings
  • P. Awasthi, M. Charikar, R. Krishnaswamy, and A. K. Sinop. The hardness of approximation of euclidean k-means. In 31st International Symposium on Computational Geometry (SoCG 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2015.
    Google ScholarLocate open access versionFindings
  • O. Bachem, M. Lucic, and A. Krause. Distributed and provably good seedings for k-means in constant rounds. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 292–300. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7):622–633, 2012.
    Google ScholarLocate open access versionFindings
  • L. Becchetti, M. Bury, V. Cohen-Addad, F. Grandoni, and C. Schwiegelshohn. Oblivious dimension reduction for k-means: beyond subspaces and the johnson–lindenstrauss lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1039–1050, 2019.
    Google ScholarLocate open access versionFindings
  • B. Boehmke and B. M. Greenwell. Hands-on machine learning with R. CRC Press, 2019.
    Google ScholarFindings
  • C. Boutsidis, A. Zouzias, and P. Drineas. Random projections for k-means clustering. In Advances in Neural Information Processing Systems, pages 298–306, 2010.
    Google ScholarLocate open access versionFindings
  • T. Brunsch and H. Röglin. A bad instance for k-means++. Theoretical Computer Science, 505: 19–26, 2013.
    Google ScholarLocate open access versionFindings
  • D. Choo, C. Grunau, J. Portmann, and V. Rozhoň. k-means++: few more steps yield constant approximation. In Proceedings of the 37th International Conference on Machine Learning, pages 7849–7057. JMLR. org, 2020.
    Google ScholarLocate open access versionFindings
  • S. Dasgupta. The hardness of k-means clustering. Department of Computer Science and Engineering, University of California, San Diego, 2008.
    Google ScholarLocate open access versionFindings
  • S. Dasgupta. UCSD CSE 291, Lecture Notes: Geometric Algorithms, 20URL: https://cseweb.ucsd.edu/~dasgupta/291-geom/kmeans.pdf. Last visited on 2020/06/01.
    Findings
  • D. Dua and C. Graff. UCI ML repository, 2017. URL http://archive.ics.uci.edu/ml.
    Findings
  • P. S. Efraimidis and P. G. Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181–185, 2006.
    Google ScholarLocate open access versionFindings
  • R. Elber. Kdd-Cup, 2004. URL http://osmot.cs.cornell.edu/kddcup/.
    Findings
  • W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.
    Google ScholarLocate open access versionFindings
  • T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2): 89 – 112, 2004. ISSN 0925-7721. doi: https://doi.org/10.1016/j.comgeo.2004.03.003. URL http://www.sciencedirect.com/science/article/pii/S0925772104000215.
    Locate open access versionFindings
  • S. Lattanzi and C. Sohler. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pages 3662–3671, 2019.
    Google ScholarLocate open access versionFindings
  • E. Lee, M. Schmidt, and J. Wright. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40–43, 2017.
    Google ScholarLocate open access versionFindings
  • S. Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2): 129–137, 1982.
    Google ScholarLocate open access versionFindings
  • K. Makarychev, Y. Makarychev, M. Sviridenko, and J. Ward. A bi-criteria approximation algorithm for k-means. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 2016.
    Google ScholarLocate open access versionFindings
  • K. Makarychev, Y. Makarychev, and I. Razenshteyn. Performance of johnson–lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1027–1038, 2019.
    Google ScholarLocate open access versionFindings
  • M. Mitzenmacher and E. Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017.
    Google ScholarFindings
  • R. Ostrovsky, Y. Rabani, L. Schulman, and C. Swamy. The effectiveness of lloyd-type methods for the k-means problem. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), 2006.
    Google ScholarLocate open access versionFindings
  • V. Rozhoň. Simple and sharp analysis of k-means||. In Proceedings of the 37th International Conference on Machine Learning, pages 7828–7837. JMLR. org, 2020.
    Google ScholarLocate open access versionFindings
  • D. Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Advances in Neural Information Processing Systems, pages 604–612, 2016.
    Google ScholarLocate open access versionFindings
Author
Aravind Reddy
Aravind Reddy
Liren Shan
Liren Shan
Your rating :
0

 

Tags
Comments
小科