Using the past to score the present: extending term weighting models through revision history analysis.

CIKM '10: International Conference on Information and Knowledge Management Toronto ON Canada October, 2010(2010)

引用 25|浏览67
暂无评分
摘要
The generative process underlies many information retrieval models, notably statistical language models. Yet these models only examine one (current) version of the document, effectively ignoring the actual document generation process. We posit that a considerable amount of information is encoded in the document authoring process, and this information is complementary to the word occurrence statistics upon which most modern retrieval models are based. We propose a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document (e.g., the edit history of a page in Wikipedia) to redefine term frequency - a key indicator of document topic/relevance for many retrieval models and text processing tasks. We then apply RHA to document ranking by extending two state-of-the-art text retrieval models, namely, BM25 and the generative statistical language model (LM). To the best of our knowledge, our paper is the first attempt to directly incorporate document authoring history into retrieval models. Empirical results show that RHA provides consistent improvements for state-of-the-art retrieval models, using standard retrieval tasks and benchmarks.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要