AI helps you reading Science
AI Insight
AI extracts a summary of this paper
Weibo:
Efficient phrase querying with flat position index
CIKM, pp.2001-2004, (2011)
WOS SCOPUS EI
Keywords
Abstract
A large proportion of search engine queries contain phrases,namely a sequence of adjacent words. In this paper, we propose to use flat position index (a.k.a schema-independent index) for phrase query evaluation. In the flat position index, the entire document collection is viewed as a huge sequence of tokens. Each token is represented by ...More
Code:
Data:
Introduction
- With the explosive growth of Web data, how to seek information efficiently and effectively has been a very important problem to both research community and industry.
- Users can submit explicit phrase queries to search engines typically by enclosing them in quotation marks.
- [9]analyzed Twitter search log and found that about 15.22% percent of the queries are celebrity names, which are possibe phrase queries.
- It indicates that phrase querying is very important for social network websites, too
Highlights
- With the explosive growth of Web data, how to seek information efficiently and effectively has been a very important problem to both research community and industry
- To seek one more efficient way for phrase search, in this paper, we propose to use the flat position index for phrase query evaluation
- We empirically find that DAAT is more efficient for phrase querying for both flat position index and the traditional word-level inverted index, so all our experiments are based on DAAT
- 1) our cache sensitive look-up table (CSLT) performs much better than all the others when the length of posting list is smaller than 5 × 106, the major reason is that the number of expected cache misses for CSLT is smaller than the others; 2) while for very long posting lists, linear search performs best since we need to perform
- We find that flat position index is very efficient for phrase evaluation
- One possible way is to explicitly store DocID and term frequency (TF) information in flat position index; since flat position index is very efficient to deal with proximity information, another promising way is to transform non-phrase queries into equivalent or approximate queries with proximity constraints
Results
- 3.1 Experiment Setup
Dataset: The authors use the TREC GOV2 collection, which consists of 25.2 million web pages crawled from the .gov Internet domain. - Query processing: There are two main approaches to process queries [2], either using DAAT(document-at-a-time) or using TAAT(termat-a-time) for the standard word-level inverted index.
- These two approaches can be employed in flat position index, too.
- The authors use a 25M document boundary array, which is sampled from GOV2 test collection.
- Based on the analysis above, the authors use the linear search algorithm when the length of boundary array is bigger than
Conclusion
- DISCUSSION AND CONCLUSIONS
In this paper, the authors propose to use flat position index for efficien-
Decoding Finding phrase
Total time t phrase querying. - The authors propose to use flat position index for efficien-.
- Total time t phrase querying.
- The authors find that flat position index is very efficient for phrase evaluation.
- The authors plan to do more exploration on how to apply flat position index to general query evaluation.
- Flat position index can be used as an auxiliary structure to support efficient proximity related queries
Tables
- Table1: Statistics of two query sets
- Table2: Skip distance for different posting list lengths
- Table3: Index size with/without nextword structure
- Table4: Performance of phrase querying in TREC query set
- Table5: Performance of phrase querying in MSN query set
Funding
- This work has been partially supported by HGJ 2010 Grant 2011ZX01042-001-001 and NSFC with Grant No.61073082, 60933004
Reference
- V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 2005.
- V. N. Anh and A. Moffat. Structured index organizations for high-throughput text querying. In SPIRE, 2006.
- D. Bahle, H. E. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In SIGIR, 2002.
- S. Büttcher and C. L. A. Clarke. Index compression is good, especially for random access. In CIKM, pages 761–770, New York, NY, USA, 2007. ACM.
- C. L. Clarke, G. V. Cormack, and F. J. Burkowski. An Algebra for Structured Text Search and A Framework for its Implementation. The Computer Journal, 1995.
- J. Dean. Invited talk: Challenges in building large-scale information retrieval systems. In WSDM, 2009.
- C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 1999.
- T. Strohman and W. B. Croft. Efficient document retrieval in main memory. In SIGIR, 2007.
- J. Teevan, D. Ramage, and M. R. Morris. #twittersearch: a comparison of microblog search and web search. In WSDM, 2011.
- H. E. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst., 2004.
Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn