BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits

Mo Tiwari
Mo Tiwari
James J Mayclin
James J Mayclin
Ilan Shomorony
Ilan Shomorony

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views17
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We propose Bandit-Partitioning Around Medoids, a randomized algorithm inspired by techniques from multi-armed bandits, that significantly improves the computational efficiency of PAM

Abstract:

Clustering is a ubiquitous task in data science. Compared to the commonly used k-means clustering algorithm, k-medoids clustering algorithms require the cluster centers to be actual data points and support arbitrary distance metrics, allowing for greater interpretability and the clustering of structured objects. Current stateof-the-art k-...More

Code:

Data:

0
Introduction
  • Many modern data science applications require the clustering of very-large-scale data.
  • The cluster center in k-means clustering is in general not a point in the dataset and may not be interpretable in many applications
  • This is especially problematic when the data is structured, such as parse trees in context-free grammars, sparse data in recommendation systems [23], or images in computer vision where the mean image is visually random noise [23]
Highlights
  • Many modern data science applications require the clustering of very-large-scale data
  • We theoretically prove that Bandit-Partitioning Around Medoids (PAM) reduces the complexity on the sample size n from O(n2) to O(n log n), both for the BUILD step and each SWAP step, under reasonable assumptions that hold in many real-world datasets
  • To design the adaptive sampling strategy, we show that the BUILD step and each SWAP iteration can be formulated as a best-arm identification problem from the multi-armed bandits (MAB) literature [1, 10, 13, 14]
  • In order to prove Theorem 1, we prove a more detailed result for each call that Bandit-PAM makes to Algorithm 1
  • We run experiments on three real-world datasets to validate the expected behavior of Bandit-PAM: the MNIST hand-written digits dataset [22], the 10x Genomics 68k PBMCs scRNA-seq dataset [40], and the Code.org Hour Of Code #4 (HOC4) coding exercise submission dataset, all of which are publicly available
  • Solutions to the programming exercise are represented as abstract syntax trees (ASTs), and we consider the tree edit distance to quantify the similar between solutions
Results
  • The authors run experiments on three real-world datasets to validate the expected behavior of Bandit-PAM: the MNIST hand-written digits dataset [22], the 10x Genomics 68k PBMCs scRNA-seq dataset [40], and the Code.org Hour Of Code #4 (HOC4) coding exercise submission dataset, all of which are publicly available.

    Datasets.
  • On scRNA-seq, the authors consider l1 distance, which is recommended [31].
  • The HOC4 dataset from Code.org [8] consists of 3,360 unique solutions to a block-based programming exercise on Code.org.
  • Solutions to the programming exercise are represented as abstract syntax trees (ASTs), and the authors consider the tree edit distance to quantify the similar between solutions
Conclusion
  • Discussion and Conclusions

    In all experiments, the authors have observed that the numbers of SWAPs are very small, typically fewer than 10, justifying the assumption of having an upper limit on the PAM SWAP step prior to running the algorithm in Sec. 4.
  • The authors observe that for all datasets, the randomly sampled distances have an empirical distribution similar to Gaussian distribution (Appendix Figures 4-5), justifying the sub-Gaussian assumption in Sec. 4.
  • Instructors can refer individual students to the feedback provided for their closest medoid.
  • The authors anticipate that this approach can be applied generally for students of Massive Open Online Courses (MOOCs), thereby enabling more equitable access to education and personalized feedback for students
Summary
  • Introduction:

    Many modern data science applications require the clustering of very-large-scale data.
  • The cluster center in k-means clustering is in general not a point in the dataset and may not be interpretable in many applications
  • This is especially problematic when the data is structured, such as parse trees in context-free grammars, sparse data in recommendation systems [23], or images in computer vision where the mean image is visually random noise [23]
  • Results:

    The authors run experiments on three real-world datasets to validate the expected behavior of Bandit-PAM: the MNIST hand-written digits dataset [22], the 10x Genomics 68k PBMCs scRNA-seq dataset [40], and the Code.org Hour Of Code #4 (HOC4) coding exercise submission dataset, all of which are publicly available.

    Datasets.
  • On scRNA-seq, the authors consider l1 distance, which is recommended [31].
  • The HOC4 dataset from Code.org [8] consists of 3,360 unique solutions to a block-based programming exercise on Code.org.
  • Solutions to the programming exercise are represented as abstract syntax trees (ASTs), and the authors consider the tree edit distance to quantify the similar between solutions
  • Conclusion:

    Discussion and Conclusions

    In all experiments, the authors have observed that the numbers of SWAPs are very small, typically fewer than 10, justifying the assumption of having an upper limit on the PAM SWAP step prior to running the algorithm in Sec. 4.
  • The authors observe that for all datasets, the randomly sampled distances have an empirical distribution similar to Gaussian distribution (Appendix Figures 4-5), justifying the sub-Gaussian assumption in Sec. 4.
  • Instructors can refer individual students to the feedback provided for their closest medoid.
  • The authors anticipate that this approach can be applied generally for students of Massive Open Online Courses (MOOCs), thereby enabling more equitable access to education and personalized feedback for students
Related work
  • Many other k-medoids algorithms exist, in addition to CLARA, CLARANS, and FastPAM as described above. Park et al [33] proposed a k-means-like algorithm that alternates between reassigning the points to their closest medoid and recomputing the medoid for each cluster until the k-medoids clustering loss can no longer be improved. Other proposals include optimizations for Euclidean space and tabu search heuristics [9]. Recent work has also focused on distributed PAM, where the dataset cannot fit on one machine [37]. All of these algorithms, however, scale quadratically in dataset size or concede the final clustering quality for improvements in runtime.

    The idea of algorithm acceleration by converting a computational problem into a statistical estimation problem and designing the adaptive sampling procedure via multi-armed bandits has witnessed a few recent successes [7, 18, 24, 15, 3, 39]. In the context of k-medoids clustering, previous work [2, 4] has considered finding the single medoid of a set points (i.e. the 1-medoid problem). In these works, the 1-medoid problem was also formulated as a best-arm identification problem, with each point being an arm and its average distance to other points being the arm parameter.
Reference
  • Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In Arxiv, 2010.
    Google ScholarLocate open access versionFindings
  • Vivek Bagaria, Govinda Kamath, Vasilis Ntranos, Martin Zhang, and David Tse. Medoids in almost-linear time via multi-armed bandits. In International Conference on Artificial Intelligence and Statistics, pages 500–509, 2018.
    Google ScholarLocate open access versionFindings
  • Vivek Bagaria, Govinda M Kamath, and David N Tse. Adaptive monte-carlo optimization. arXiv preprint arXiv:1805.08321, 2018.
    Findings
  • Tavor Baharav and David Tse. Ultra fast medoid identification via correlated sequential halving. In Advances in Neural Information Processing Systems, pages 3650–3659, 2019.
    Google ScholarLocate open access versionFindings
  • Paul S Bradley, Olvi L Mangasarian, and W Nick Street. Clustering via concave minimization. In Advances in Neural Information Processing Systems, pages 368–374, 1997.
    Google ScholarLocate open access versionFindings
  • Donald Cameron and Ian Jones. John snow, the broad street pump and modern epidemiology. In International Journal of Epidemiology, volume 12, page 393–396, 1983.
    Google ScholarLocate open access versionFindings
  • Hyeong Soo Chang, Michael C Fu, Jiaqiao Hu, and Steven I Marcus. An adaptive sampling algorithm for solving markov decision processes. Operations Research, 53(1):126–139, 2005.
    Google ScholarLocate open access versionFindings
  • Code.org. Research at code.org. Code.org, 2013.
    Google ScholarLocate open access versionFindings
  • Vladimir Estivill-Castro and Michael E Houle. Robust distance-based clustering with applications to spatial data mining. Algorithmica, 30(2):216–242, 2001.
    Google ScholarLocate open access versionFindings
  • Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255–270.
    Google ScholarLocate open access versionFindings
  • Kent Hymel, Kenneth Small, and Kurt Van Dender. Induced demand and reboundeffects in road transport. In Transportation Research B, Methodological, volume 44, page 1220–1241, 2010.
    Google ScholarLocate open access versionFindings
  • Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
    Google ScholarFindings
  • Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014.
    Google ScholarLocate open access versionFindings
  • Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014.
    Google ScholarLocate open access versionFindings
  • Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics, pages 240–248, 2016.
    Google ScholarLocate open access versionFindings
  • Leonard Kaufman and Peter J Rousseeuw. Clustering by means of medoids. statistical data analysis based on the l1 norm. Y. Dodge, Ed, pages 405–416, 1987.
    Google ScholarFindings
  • Leonard Kaufman and Peter J Rousseeuw. Partitioning around medoids (program pam). Finding groups in data: an introduction to cluster analysis, pages 68–125, 1990.
    Google ScholarFindings
  • Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293.
    Google ScholarLocate open access versionFindings
  • Branislav Kveton, Csaba Szepesvari, and Mohammad Ghavamzadeh. Perturbed-history exploration in stochastic multi-armed bandits. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019.
    Google ScholarLocate open access versionFindings
  • Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Mohammad Ghavamzadeh, and Tor Lattimore. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, page 3601–3610, 2019.
    Google ScholarLocate open access versionFindings
  • Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive data sets. Cambridge university press, 2020.
    Google ScholarFindings
  • Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.
    Findings
  • Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
    Google ScholarLocate open access versionFindings
  • Malte Luecken and Fabian Theis. Current best practices in single-cell rna-seq analysis: A tutorial. Molecular Systems Biology, 155:e8746(6), 2019.
    Google ScholarLocate open access versionFindings
  • James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
    Google ScholarLocate open access versionFindings
  • Nina Mishra, Robert Schreiber, Isabelle Stanton, and Robert E Tarjan. Clustering social networks. In International Workshop on Algorithms and Models for the Web-Graph, pages 56–67.
    Google ScholarLocate open access versionFindings
  • Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31–88, 2001.
    Google ScholarLocate open access versionFindings
  • Raymond T. Ng and Jiawei Han. Clarans: A method for clustering objects for spatial data mining. IEEE transactions on knowledge and data engineering, 14(5):1003–1016, 2002.
    Google ScholarLocate open access versionFindings
  • Vasilis Ntranos, Govinda M Kamath, Jesse M Zhang, Lior Pachter, and N Tse David. Fast and accurate single-cell rna-seq analysis by clustering of transcript-compatibility counts. Genome biology, 17(1):112, 2016.
    Google ScholarLocate open access versionFindings
  • Michael L Overton. A quadratically convergent method for minimizing a sum of euclidean norms. Mathematical Programming, 27(1):34–63, 1983.
    Google ScholarLocate open access versionFindings
  • Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336–3341, 2009.
    Google ScholarLocate open access versionFindings
  • Alan P Reynolds, Graeme Richards, Beatriz de la Iglesia, and Victor J Rayward-Smith. Clustering rules: a comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms, 5(4):475–504, 2006.
    Google ScholarLocate open access versionFindings
  • Erich Schubert and Peter J Rousseeuw. Faster k-medoids clustering: improving the pam, clara, and clarans algorithms. In International Conference on Similarity Search and Applications, pages 171–187.
    Google ScholarLocate open access versionFindings
  • Erich Schubert and Arthur Zimek. Elki: A large open-source library for data analysis-elki release 0.7. 5" heidelberg". arXiv preprint arXiv:1902.03616, 2019.
    Findings
  • Hwanjun Song, Jae-Gil Lee, and Wook-shin Han. Pamae: Parallel k -medoids clustering with high accuracy and efficiency. In Proc.
    Google ScholarLocate open access versionFindings
  • Chi-Hua Wang, Yang Yu, Botao Hao, and Guang Cheng. Residual bootstrap exploration for bandit algorithms. In Arxiv, 2020.
    Google ScholarLocate open access versionFindings
  • Martin Zhang, James Zou, and David Tse. Adaptive monte carlo multiple testing via multi-armed bandits. In International Conference on Machine Learning, pages 7512–7522, 2019.
    Google ScholarLocate open access versionFindings
  • Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):1–12, 2017.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments