Data Stream Clustering: An In-depth Empirical Study.

Xin Wang, Zhengru Wang, Zhenyu Wu,Shuhao Zhang, Xuanhua Shi,Li Lu

Proc. ACM Manag. Data(2023)

引用 0|浏览3
暂无评分
摘要
Data Stream Clustering (DSC) plays an important role in mining continuous and unlabeled data streams in real-world applications. Over the last decades, numerous DSC algorithms have been proposed with promising clustering accuracy and efficiency. Despite the significant differences among existing DSC algorithms, they are commonly built around four key design aspects: summarizing data structure, window model, outlier detection mechanism, and offline refinement strategy. However, there is a lack of empirical studies on these key design aspects in the same codebase using real-world workloads with distinct characteristics. As a result, it is difficult for researchers to improve upon the state-of-the-art. In this paper, we conduct such a study of DSC on its four key design aspects. We implemented state-of-the-art variants of all of these design choices in an open-sourced platform from scratch and evaluated them using both real-world and synthetic workloads. Our analysis identifies the fundamental issues and trade-offs of each design choice in terms of both accuracy and efficiency. We even find that combining flexible design choices led to the development of a new algorithm called Benne, which can be tuned to achieve either better accuracy or better efficiency compared to the state-of-the-art.
更多
查看译文
关键词
data stream clustering,empirical study,in-depth
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要