Wordnet-based metrics do not seem to help document clustering
msra
摘要
In most document clustering systems documents are repre- sented as normalized bags of words and clustering is done maximizing cosine similarity between documents in the same cluster. While this representation was found to be very eective at many dierent types of clustering, it has some intuitive drawbacks. One such drawback is that documents containing words with similar meanings might be considered very dierent if they use dierent words to say the same thing. This happens because in a traditional bag of words, all words are assumed to be orthogonal. In this paper we examine many possible ways of using WordNet to mitigate this problem, and find that WordNet does not help clustering if used only as a means of finding word similarity.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络