AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We empirically find that DAAT is more efficient for phrase querying for both flat position index and the traditional word-level inverted index, so all our experiments are based on DAAT

Efficient phrase querying with flat position index

CIKM, pp.2001-2004, (2011)

Cited: 11|Views49
WOS SCOPUS EI

Abstract

A large proportion of search engine queries contain phrases,namely a sequence of adjacent words. In this paper, we propose to use flat position index (a.k.a schema-independent index) for phrase query evaluation. In the flat position index, the entire document collection is viewed as a huge sequence of tokens. Each token is represented by ...More

Code:

Data:

0
Introduction
  • With the explosive growth of Web data, how to seek information efficiently and effectively has been a very important problem to both research community and industry.
  • Users can submit explicit phrase queries to search engines typically by enclosing them in quotation marks.
  • [9]analyzed Twitter search log and found that about 15.22% percent of the queries are celebrity names, which are possibe phrase queries.
  • It indicates that phrase querying is very important for social network websites, too
Highlights
  • With the explosive growth of Web data, how to seek information efficiently and effectively has been a very important problem to both research community and industry
  • To seek one more efficient way for phrase search, in this paper, we propose to use the flat position index for phrase query evaluation
  • We empirically find that DAAT is more efficient for phrase querying for both flat position index and the traditional word-level inverted index, so all our experiments are based on DAAT
  • 1) our cache sensitive look-up table (CSLT) performs much better than all the others when the length of posting list is smaller than 5 × 106, the major reason is that the number of expected cache misses for CSLT is smaller than the others; 2) while for very long posting lists, linear search performs best since we need to perform
  • We find that flat position index is very efficient for phrase evaluation
  • One possible way is to explicitly store DocID and term frequency (TF) information in flat position index; since flat position index is very efficient to deal with proximity information, another promising way is to transform non-phrase queries into equivalent or approximate queries with proximity constraints
Results
  • 3.1 Experiment Setup

    Dataset: The authors use the TREC GOV2 collection, which consists of 25.2 million web pages crawled from the .gov Internet domain.
  • Query processing: There are two main approaches to process queries [2], either using DAAT(document-at-a-time) or using TAAT(termat-a-time) for the standard word-level inverted index.
  • These two approaches can be employed in flat position index, too.
  • The authors use a 25M document boundary array, which is sampled from GOV2 test collection.
  • Based on the analysis above, the authors use the linear search algorithm when the length of boundary array is bigger than
Conclusion
  • DISCUSSION AND CONCLUSIONS

    In this paper, the authors propose to use flat position index for efficien-

    Decoding Finding phrase

    Total time t phrase querying.
  • The authors propose to use flat position index for efficien-.
  • Total time t phrase querying.
  • The authors find that flat position index is very efficient for phrase evaluation.
  • The authors plan to do more exploration on how to apply flat position index to general query evaluation.
  • Flat position index can be used as an auxiliary structure to support efficient proximity related queries
Tables
  • Table1: Statistics of two query sets
  • Table2: Skip distance for different posting list lengths
  • Table3: Index size with/without nextword structure
  • Table4: Performance of phrase querying in TREC query set
  • Table5: Performance of phrase querying in MSN query set
Download tables as Excel
Funding
  • This work has been partially supported by HGJ 2010 Grant 2011ZX01042-001-001 and NSFC with Grant No.61073082, 60933004
Reference
  • V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 2005.
    Google ScholarLocate open access versionFindings
  • V. N. Anh and A. Moffat. Structured index organizations for high-throughput text querying. In SPIRE, 2006.
    Google ScholarLocate open access versionFindings
  • D. Bahle, H. E. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. In SIGIR, 2002.
    Google ScholarLocate open access versionFindings
  • S. Büttcher and C. L. A. Clarke. Index compression is good, especially for random access. In CIKM, pages 761–770, New York, NY, USA, 2007. ACM.
    Google ScholarFindings
  • C. L. Clarke, G. V. Cormack, and F. J. Burkowski. An Algebra for Structured Text Search and A Framework for its Implementation. The Computer Journal, 1995.
    Google ScholarLocate open access versionFindings
  • J. Dean. Invited talk: Challenges in building large-scale information retrieval systems. In WSDM, 2009.
    Google ScholarLocate open access versionFindings
  • C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 1999.
    Google ScholarFindings
  • T. Strohman and W. B. Croft. Efficient document retrieval in main memory. In SIGIR, 2007.
    Google ScholarLocate open access versionFindings
  • J. Teevan, D. Ramage, and M. R. Morris. #twittersearch: a comparison of microblog search and web search. In WSDM, 2011.
    Google ScholarLocate open access versionFindings
  • H. E. Williams, J. Zobel, and D. Bahle. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst., 2004.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn