AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We empirically find that DAAT is more efficient for phrase querying for both flat position index and the traditional word-level inverted index, so all our experiments are based on DAAT
Efficient phrase querying with flat position index
CIKM, pp.2001-2004, (2011)
WOS SCOPUS EI
A large proportion of search engine queries contain phrases,namely a sequence of adjacent words. In this paper, we propose to use flat position index (a.k.a schema-independent index) for phrase query evaluation. In the flat position index, the entire document collection is viewed as a huge sequence of tokens. Each token is represented by ...More
PPT (Upload PPT)
- With the explosive growth of Web data, how to seek information efficiently and effectively has been a very important problem to both research community and industry.
- Users can submit explicit phrase queries to search engines typically by enclosing them in quotation marks.
- analyzed Twitter search log and found that about 15.22% percent of the queries are celebrity names, which are possibe phrase queries.
- It indicates that phrase querying is very important for social network websites, too
- With the explosive growth of Web data, how to seek information efficiently and effectively has been a very important problem to both research community and industry
- To seek one more efficient way for phrase search, in this paper, we propose to use the flat position index for phrase query evaluation
- We empirically find that DAAT is more efficient for phrase querying for both flat position index and the traditional word-level inverted index, so all our experiments are based on DAAT
- 1) our cache sensitive look-up table (CSLT) performs much better than all the others when the length of posting list is smaller than 5 × 106, the major reason is that the number of expected cache misses for CSLT is smaller than the others; 2) while for very long posting lists, linear search performs best since we need to perform
- We find that flat position index is very efficient for phrase evaluation
- One possible way is to explicitly store DocID and term frequency (TF) information in flat position index; since flat position index is very efficient to deal with proximity information, another promising way is to transform non-phrase queries into equivalent or approximate queries with proximity constraints
- 3.1 Experiment Setup
Dataset: The authors use the TREC GOV2 collection, which consists of 25.2 million web pages crawled from the .gov Internet domain.
- Query processing: There are two main approaches to process queries , either using DAAT(document-at-a-time) or using TAAT(termat-a-time) for the standard word-level inverted index.
- These two approaches can be employed in flat position index, too.
- The authors use a 25M document boundary array, which is sampled from GOV2 test collection.
- Based on the analysis above, the authors use the linear search algorithm when the length of boundary array is bigger than
- DISCUSSION AND CONCLUSIONS
In this paper, the authors propose to use flat position index for efficien-
Decoding Finding phrase
Total time t phrase querying.
- The authors propose to use flat position index for efficien-.
- Total time t phrase querying.
- The authors find that flat position index is very efficient for phrase evaluation.
- The authors plan to do more exploration on how to apply flat position index to general query evaluation.
- Flat position index can be used as an auxiliary structure to support efficient proximity related queries
- Table1: Statistics of two query sets
- Table2: Skip distance for different posting list lengths
- Table3: Index size with/without nextword structure
- Table4: Performance of phrase querying in TREC query set
- Table5: Performance of phrase querying in MSN query set
- This work has been partially supported by HGJ 2010 Grant 2011ZX01042-001-001 and NSFC with Grant No.61073082, 60933004
- V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 2005.
- V. N. Anh and A. Moffat. Structured index organizations for high-throughput text querying. In SPIRE, 2006.
- D. Bahle, H. E. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In SIGIR, 2002.
- S. Büttcher and C. L. A. Clarke. Index compression is good, especially for random access. In CIKM, pages 761–770, New York, NY, USA, 2007. ACM.
- C. L. Clarke, G. V. Cormack, and F. J. Burkowski. An Algebra for Structured Text Search and A Framework for its Implementation. The Computer Journal, 1995.
- J. Dean. Invited talk: Challenges in building large-scale information retrieval systems. In WSDM, 2009.
- C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 1999.
- T. Strohman and W. B. Croft. Efficient document retrieval in main memory. In SIGIR, 2007.
- J. Teevan, D. Ramage, and M. R. Morris. #twittersearch: a comparison of microblog search and web search. In WSDM, 2011.
- H. E. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst., 2004.