# BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Clustering is a ubiquitous task in data science. Compared to the commonly used k-means clustering algorithm, k-medoids clustering algorithms require the cluster centers to be actual data points and support arbitrary distance metrics, allowing for greater interpretability and the clustering of structured objects. Current stateof-the-art k-...More

Code:

Data:

Introduction

- Many modern data science applications require the clustering of very-large-scale data.
- The cluster center in k-means clustering is in general not a point in the dataset and may not be interpretable in many applications
- This is especially problematic when the data is structured, such as parse trees in context-free grammars, sparse data in recommendation systems [23], or images in computer vision where the mean image is visually random noise [23]

Highlights

- Many modern data science applications require the clustering of very-large-scale data
- We theoretically prove that Bandit-Partitioning Around Medoids (PAM) reduces the complexity on the sample size n from O(n2) to O(n log n), both for the BUILD step and each SWAP step, under reasonable assumptions that hold in many real-world datasets
- To design the adaptive sampling strategy, we show that the BUILD step and each SWAP iteration can be formulated as a best-arm identification problem from the multi-armed bandits (MAB) literature [1, 10, 13, 14]
- In order to prove Theorem 1, we prove a more detailed result for each call that Bandit-PAM makes to Algorithm 1
- We run experiments on three real-world datasets to validate the expected behavior of Bandit-PAM: the MNIST hand-written digits dataset [22], the 10x Genomics 68k PBMCs scRNA-seq dataset [40], and the Code.org Hour Of Code #4 (HOC4) coding exercise submission dataset, all of which are publicly available
- Solutions to the programming exercise are represented as abstract syntax trees (ASTs), and we consider the tree edit distance to quantify the similar between solutions

Results

- The authors run experiments on three real-world datasets to validate the expected behavior of Bandit-PAM: the MNIST hand-written digits dataset [22], the 10x Genomics 68k PBMCs scRNA-seq dataset [40], and the Code.org Hour Of Code #4 (HOC4) coding exercise submission dataset, all of which are publicly available.

Datasets. - On scRNA-seq, the authors consider l1 distance, which is recommended [31].
- The HOC4 dataset from Code.org [8] consists of 3,360 unique solutions to a block-based programming exercise on Code.org.
- Solutions to the programming exercise are represented as abstract syntax trees (ASTs), and the authors consider the tree edit distance to quantify the similar between solutions

Conclusion

**Discussion and Conclusions**

In all experiments, the authors have observed that the numbers of SWAPs are very small, typically fewer than 10, justifying the assumption of having an upper limit on the PAM SWAP step prior to running the algorithm in Sec. 4.- The authors observe that for all datasets, the randomly sampled distances have an empirical distribution similar to Gaussian distribution (Appendix Figures 4-5), justifying the sub-Gaussian assumption in Sec. 4.
- Instructors can refer individual students to the feedback provided for their closest medoid.
- The authors anticipate that this approach can be applied generally for students of Massive Open Online Courses (MOOCs), thereby enabling more equitable access to education and personalized feedback for students

Summary

## Introduction:

Many modern data science applications require the clustering of very-large-scale data.- The cluster center in k-means clustering is in general not a point in the dataset and may not be interpretable in many applications
- This is especially problematic when the data is structured, such as parse trees in context-free grammars, sparse data in recommendation systems [23], or images in computer vision where the mean image is visually random noise [23]
## Results:

The authors run experiments on three real-world datasets to validate the expected behavior of Bandit-PAM: the MNIST hand-written digits dataset [22], the 10x Genomics 68k PBMCs scRNA-seq dataset [40], and the Code.org Hour Of Code #4 (HOC4) coding exercise submission dataset, all of which are publicly available.

Datasets.- On scRNA-seq, the authors consider l1 distance, which is recommended [31].
- The HOC4 dataset from Code.org [8] consists of 3,360 unique solutions to a block-based programming exercise on Code.org.
- Solutions to the programming exercise are represented as abstract syntax trees (ASTs), and the authors consider the tree edit distance to quantify the similar between solutions
## Conclusion:

**Discussion and Conclusions**

In all experiments, the authors have observed that the numbers of SWAPs are very small, typically fewer than 10, justifying the assumption of having an upper limit on the PAM SWAP step prior to running the algorithm in Sec. 4.- The authors observe that for all datasets, the randomly sampled distances have an empirical distribution similar to Gaussian distribution (Appendix Figures 4-5), justifying the sub-Gaussian assumption in Sec. 4.
- Instructors can refer individual students to the feedback provided for their closest medoid.
- The authors anticipate that this approach can be applied generally for students of Massive Open Online Courses (MOOCs), thereby enabling more equitable access to education and personalized feedback for students

Related work

- Many other k-medoids algorithms exist, in addition to CLARA, CLARANS, and FastPAM as described above. Park et al [33] proposed a k-means-like algorithm that alternates between reassigning the points to their closest medoid and recomputing the medoid for each cluster until the k-medoids clustering loss can no longer be improved. Other proposals include optimizations for Euclidean space and tabu search heuristics [9]. Recent work has also focused on distributed PAM, where the dataset cannot fit on one machine [37]. All of these algorithms, however, scale quadratically in dataset size or concede the final clustering quality for improvements in runtime.

The idea of algorithm acceleration by converting a computational problem into a statistical estimation problem and designing the adaptive sampling procedure via multi-armed bandits has witnessed a few recent successes [7, 18, 24, 15, 3, 39]. In the context of k-medoids clustering, previous work [2, 4] has considered finding the single medoid of a set points (i.e. the 1-medoid problem). In these works, the 1-medoid problem was also formulated as a best-arm identification problem, with each point being an arm and its average distance to other points being the arm parameter.

Reference

- Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In Arxiv, 2010.
- Vivek Bagaria, Govinda Kamath, Vasilis Ntranos, Martin Zhang, and David Tse. Medoids in almost-linear time via multi-armed bandits. In International Conference on Artificial Intelligence and Statistics, pages 500–509, 2018.
- Vivek Bagaria, Govinda M Kamath, and David N Tse. Adaptive monte-carlo optimization. arXiv preprint arXiv:1805.08321, 2018.
- Tavor Baharav and David Tse. Ultra fast medoid identification via correlated sequential halving. In Advances in Neural Information Processing Systems, pages 3650–3659, 2019.
- Paul S Bradley, Olvi L Mangasarian, and W Nick Street. Clustering via concave minimization. In Advances in Neural Information Processing Systems, pages 368–374, 1997.
- Donald Cameron and Ian Jones. John snow, the broad street pump and modern epidemiology. In International Journal of Epidemiology, volume 12, page 393–396, 1983.
- Hyeong Soo Chang, Michael C Fu, Jiaqiao Hu, and Steven I Marcus. An adaptive sampling algorithm for solving markov decision processes. Operations Research, 53(1):126–139, 2005.
- Code.org. Research at code.org. Code.org, 2013.
- Vladimir Estivill-Castro and Michael E Houle. Robust distance-based clustering with applications to spatial data mining. Algorithmica, 30(2):216–242, 2001.
- Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255–270.
- Kent Hymel, Kenneth Small, and Kurt Van Dender. Induced demand and reboundeffects in road transport. In Transportation Research B, Methodological, volume 44, page 1220–1241, 2010.
- Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
- Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014.
- Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014.
- Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics, pages 240–248, 2016.
- Leonard Kaufman and Peter J Rousseeuw. Clustering by means of medoids. statistical data analysis based on the l1 norm. Y. Dodge, Ed, pages 405–416, 1987.
- Leonard Kaufman and Peter J Rousseeuw. Partitioning around medoids (program pam). Finding groups in data: an introduction to cluster analysis, pages 68–125, 1990.
- Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293.
- Branislav Kveton, Csaba Szepesvari, and Mohammad Ghavamzadeh. Perturbed-history exploration in stochastic multi-armed bandits. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019.
- Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Mohammad Ghavamzadeh, and Tor Lattimore. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, page 3601–3610, 2019.
- Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive data sets. Cambridge university press, 2020.
- Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.
- Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
- Malte Luecken and Fabian Theis. Current best practices in single-cell rna-seq analysis: A tutorial. Molecular Systems Biology, 155:e8746(6), 2019.
- James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
- Nina Mishra, Robert Schreiber, Isabelle Stanton, and Robert E Tarjan. Clustering social networks. In International Workshop on Algorithms and Models for the Web-Graph, pages 56–67.
- Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31–88, 2001.
- Raymond T. Ng and Jiawei Han. Clarans: A method for clustering objects for spatial data mining. IEEE transactions on knowledge and data engineering, 14(5):1003–1016, 2002.
- Vasilis Ntranos, Govinda M Kamath, Jesse M Zhang, Lior Pachter, and N Tse David. Fast and accurate single-cell rna-seq analysis by clustering of transcript-compatibility counts. Genome biology, 17(1):112, 2016.
- Michael L Overton. A quadratically convergent method for minimizing a sum of euclidean norms. Mathematical Programming, 27(1):34–63, 1983.
- Hae-Sang Park and Chi-Hyuck Jun. A simple and fast algorithm for k-medoids clustering. Expert systems with applications, 36(2):3336–3341, 2009.
- Alan P Reynolds, Graeme Richards, Beatriz de la Iglesia, and Victor J Rayward-Smith. Clustering rules: a comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms, 5(4):475–504, 2006.
- Erich Schubert and Peter J Rousseeuw. Faster k-medoids clustering: improving the pam, clara, and clarans algorithms. In International Conference on Similarity Search and Applications, pages 171–187.
- Erich Schubert and Arthur Zimek. Elki: A large open-source library for data analysis-elki release 0.7. 5" heidelberg". arXiv preprint arXiv:1902.03616, 2019.
- Hwanjun Song, Jae-Gil Lee, and Wook-shin Han. Pamae: Parallel k -medoids clustering with high accuracy and efficiency. In Proc.
- Chi-Hua Wang, Yang Yu, Botao Hao, and Guang Cheng. Residual bootstrap exploration for bandit algorithms. In Arxiv, 2020.
- Martin Zhang, James Zou, and David Tse. Adaptive monte carlo multiple testing via multi-armed bandits. In International Conference on Machine Learning, pages 7512–7522, 2019.
- Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):1–12, 2017.

Full Text

Tags

Comments