From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection
CoRR(2024)
摘要
In clustering algorithm selection, we are given a massive dataset and must
efficiently select which clustering algorithm to use. We study this problem in
a semi-supervised setting, with an unknown ground-truth clustering that we can
only access through expensive oracle queries. Ideally, the clustering
algorithm's output will be structurally close to the ground truth. We approach
this problem by introducing a notion of size generalization for clustering
algorithm accuracy. We identify conditions under which we can (1) subsample the
massive clustering instance, (2) evaluate a set of candidate algorithms on the
smaller instance, and (3) guarantee that the algorithm with the best accuracy
on the small instance will have the best accuracy on the original big instance.
We provide theoretical size generalization guarantees for three classic
clustering algorithms: single-linkage, k-means++, and (a smoothed variant of)
Gonzalez's k-centers heuristic. We validate our theoretical analysis with
empirical results, observing that on real-world clustering instances, we can
use a subsample of as little as 5
best on the full dataset.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要