Extended Strategies For Document Clustering With Word Co-Occurrences

WEB TECHNOLOGIES AND APPLICATIONS (APWEB 2015)(2015)

引用 0|浏览10
暂无评分
摘要
To tackle the sparse data problem of the bag-of-words model for document clustering, recent strategies have been proposed to enrich a document with the relatedness of all the words in a corpus to the document, where the relatedness is estimated by the weighted sum of word co-occurrences. However, the relatedness is overestimated without eliminating the overlaps between word co-occurrences. This paper demonstrates that the weighted sum strategy gives the upper bound of the theoretic degree of relatedness. Two strategies are further proposed to approach the theoretic degree of relatedness. The first strategy is established under the extreme assumption that all the words in a document co-occur with each other. By considering the specificities of words, the second strategy gives several extended versions of the weighted sum strategy. Substantial experiments verify that the document clustering incorporated with the extended strategies achieve a significant performance improvement compared to the state-of-the-art techniques.
更多
查看译文
关键词
document enriching,word co-occurrences,specificity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要