Short text clustering based on Pitman-Yor process mixture model

Jipeng Qiang,Yun Li,Yunhao Yuan,Xindong Wu

Appl. Intell.（2017）

引用 34|浏览58

暂无评分

摘要

For finding the appropriate number of clusters in short text clustering, models based on Dirichlet Multinomial Mixture (DMM) require the maximum possible cluster number before inferring the real number of clusters. However, it is difficult to choose a proper number as we do not know the true number of clusters in short texts beforehand. The cluster distribution in DMM based on Dirichlet process as prior goes down exponentially as the number of clusters increases. Therefore, we propose a novel model based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution in the paper. Specifically, each text chooses one of the active clusters or a new cluster with probabilities derived from the Pitman-Yor Process Mixture model (PYPM). Discriminative words and nondiscriminative words are identified automatically to help enhance text clustering. Parameters are estimated efficiently by collapsed Gibbs sampling and experimental results show PYPM is robust and effective comparing with the state-of-the-art models.

查看译文

关键词

LDA,Pitman-Yor process,Short text clustering

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要