Joint Probability Consistent Relation Analysis for Document Representation.

DASFAA(2016)

引用 25|浏览7
暂无评分
摘要
Measuring the semantic similarities between documents is an important issue because it is the basis for many applications, such as document summarization, web search, text analysis, and so forth. Although many studies have explored this problem through enriching the document vectors based on the relatedness of the words involved, the performance is still far from satisfaction because of the insufficiency of data, i.e., the sparse and anomalous co-occurrences between words. The insufficient data can only generate unreliable relatedness between words. In this paper, we propose an effective approach to correct the unreliable relatedness, which keeps the joint probabilities of the co-occurrences between each word and themselves consistently equal to their occurrence probabilities throughout the generation of the relatedness. Hence the unreliable relatedness is effectively corrected by referring to the occurrence frequencies of the words, which is confirmed theoretically and experimentally. The thorough evaluation conducted on real datasets illustrates that significant improvement has been achieved on document clustering compared with the state-of-the-art methods.
更多
查看译文
关键词
Document representation, Word relatedness, Joint probability consistency
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要