Top- k Term-Proximity in Succinct Space

J. Ian Munro,Gonzalo Navarro,Jesper Sindahl Nielsen,Rahul Shah,Sharma V. Thankachan

Algorithmica（2016）

引用 13|浏览36

暂无评分

摘要

Let 𝒟 = {𝖳_1,𝖳_2, … ,𝖳_D} be a collection of D string documents of n characters in total, that are drawn from an alphabet set =[σ ] . The top-k document retrieval problem is to preprocess 𝒟 into a data structure that, given a query (P[1… p],k) , can return the k documents of 𝒟 most relevant to the pattern P . The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in 𝖳_d . For example, it can be the term frequency (i.e., the number of occurrences of P in 𝖳_d ), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in 𝖳_d ), or a pattern-independent importance score of 𝖳_d such as PageRank. Linear space and optimal query time solutions already exist for the general top- k document retrieval problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only o ( n ) bits on top of any compressed suffix array of 𝒟 and solves queries in O((p+k) polylog n) time. We also show that scores that consist of a weighted combination of term proximity, term frequency, and document importance, can be handled using twice the space required to represent the text collection.

查看译文

关键词

Document indexing,Top-k document retrieval,Ranked document retrieval,Succinct data structures,Compressed data structures,Compact data structures,Proximity search

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要