A model-based approach for text clustering with outlier detection

2016 IEEE 32nd International Conference on Data Engineering (ICDE)(2016)

引用 82|浏览200
暂无评分
摘要
Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.
更多
查看译文
关键词
model-based approach,text clustering,outlier detection,collapsed Gibbs sampling algorithm,Dirichlet process multinomial mixture model,GSDPMM,space complexity,time complexity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要