Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

Kamphuis Chris
Kamphuis Chris
Mallia Antonio
Mallia Antonio
Siedlaczek Michał
Siedlaczek Michał
de Vries Arjen
de Vries Arjen

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval Virtual Event China July, 2020, pp. 2149-2152, 2020.

Cited by: 4|Views68
EI
Weibo:
If we are able to devise a mechanism for different search engines to share index structures, this would represent substantial progress towards achieving our aforementioned goals

Abstract:

There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal ab...More

Code:

Data:

0
Introduction
  • Academic information retrieval researchers often share their innovations in open-source search engines, a tradition that dates back to the SMART system in the mid 1980s [2].
  • Many mundane details such as the stemmer, stopwords list, and other difficult-to-document implementation choices matter a great deal, often having a greater impact than more substantive differences such as ranking models.
  • These issues affect efficiency-focused studies—for example, the presence or absence of stopwords alters skipping behavior during postings traversal.
  • If the authors are able to devise a mechanism for different search engines to share index structures, this would represent substantial progress towards achieving the aforementioned goals
Highlights
  • Academic information retrieval researchers often share their innovations in open-source search engines, a tradition that dates back to the SMART system in the mid 1980s [2]
  • If we are able to devise a mechanism for different search engines to share index structures, this would represent substantial progress towards achieving our aforementioned goals
  • Our efforts brought together researchers who have built a number of open-source search engines:
  • As an example of the wrapper approach, we describe how interoperability between Terrier and Anserini, both Java-based systems, is achieved by wrapping the Lucene indexes generated by Anserini in Terrier APIs, such that Terrier can directly traverse Lucene postings for query evaluation
Methods
  • The authors' efforts brought together researchers who have built a number of open-source search engines: Anserini [15] is an IR toolkit built on the popular open-source

    Lucene search library. JASSv2 [12], written in C++, uses an impact-ordered index and processes postings Score-at-a-Time.
  • The authors used the following two test collections: Robust04: TREC Disks 4 & 5, excluding Congressional Record, with topics and relevance judgments from the ad hoc task at TREC-6 through TREC-8 as well as the Robust Tracks from TREC 2003 and 2004. ClueWeb12B: The ClueWeb12-B13 web crawl from Carnegie Mellon University, with topics and relevance judgments from the TREC 2013 and 2014 Web Tracks
Results
  • Robust04 are shown in Table 1, where it is possible to compare different query expansion methods using essentially the same index.
  • The authors note that differences in BM25 effectiveness are very small, while the various query expansion methods have at most 2% AP difference.
  • Since Terrier and Anserini were both implemented in Java, API-level integration was not too onerous.
  • 1 https://github.com/cmacdonald/terrier- lucene System AP P@30 Anserini (BM25).
  • Anserini (BM25+Axiomatic QE) 0.2896 0.3333.
  • Terrier-Lucene (BM25+Bo1 QE) 0.2890 0.3356
Conclusion
  • The authors envision Ciff to be an ongoing, open, and community-driven effort that allows researchers to independently pursue their own lines of inquiry while supporting fair and meaningful evaluations.
  • Additional contributions are most welcome! As the efforts gain traction, the authors envision future research papers adopting “standard” Ciff exports in their experiments—this would have the dual benefit of standardizing empirical methodology and more clearly highlighting the impact of proposed innovations
Summary
  • Introduction:

    Academic information retrieval researchers often share their innovations in open-source search engines, a tradition that dates back to the SMART system in the mid 1980s [2].
  • Many mundane details such as the stemmer, stopwords list, and other difficult-to-document implementation choices matter a great deal, often having a greater impact than more substantive differences such as ranking models.
  • These issues affect efficiency-focused studies—for example, the presence or absence of stopwords alters skipping behavior during postings traversal.
  • If the authors are able to devise a mechanism for different search engines to share index structures, this would represent substantial progress towards achieving the aforementioned goals
  • Methods:

    The authors' efforts brought together researchers who have built a number of open-source search engines: Anserini [15] is an IR toolkit built on the popular open-source

    Lucene search library. JASSv2 [12], written in C++, uses an impact-ordered index and processes postings Score-at-a-Time.
  • The authors used the following two test collections: Robust04: TREC Disks 4 & 5, excluding Congressional Record, with topics and relevance judgments from the ad hoc task at TREC-6 through TREC-8 as well as the Robust Tracks from TREC 2003 and 2004. ClueWeb12B: The ClueWeb12-B13 web crawl from Carnegie Mellon University, with topics and relevance judgments from the TREC 2013 and 2014 Web Tracks
  • Results:

    Robust04 are shown in Table 1, where it is possible to compare different query expansion methods using essentially the same index.
  • The authors note that differences in BM25 effectiveness are very small, while the various query expansion methods have at most 2% AP difference.
  • Since Terrier and Anserini were both implemented in Java, API-level integration was not too onerous.
  • 1 https://github.com/cmacdonald/terrier- lucene System AP P@30 Anserini (BM25).
  • Anserini (BM25+Axiomatic QE) 0.2896 0.3333.
  • Terrier-Lucene (BM25+Bo1 QE) 0.2890 0.3356
  • Conclusion:

    The authors envision Ciff to be an ongoing, open, and community-driven effort that allows researchers to independently pursue their own lines of inquiry while supporting fair and meaningful evaluations.
  • Additional contributions are most welcome! As the efforts gain traction, the authors envision future research papers adopting “standard” Ciff exports in their experiments—this would have the dual benefit of standardizing empirical methodology and more clearly highlighting the impact of proposed innovations
Tables
  • Table1: Comparison of Anserini, Terrier, and the Terrier wrapper for Anserini’s Lucene indexes (Terrier-Lucene) on Robust04
  • Table2: Comparison of BM25 variants
Download tables as Excel
Funding
  • This research was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada, Compute Ontario and Compute Canada, the Australian Research Council (ARC) Discovery Grant DP170102231, the US National Science Foundation (IIS-1718680), and research program Commit2Data with project number 628.011.001 financed by the Dutch Research Council (NWO)
Reference
  • Leonid Boytsov, David Novak, Yury Malkov, and Eric Nyberg. 2016. Off the Beaten Path: Let’s Replace Term-Based Retrieval with k-NN Search. In CIKM. 1099–1108.
    Google ScholarFindings
  • Chris Buckley. 1985. Implementation of the SMART Information Retrieval System. Department of Computer Science TR 85-686. Cornell University.
    Google ScholarFindings
  • Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, and Ze Zhong Wu. 2019. Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019). In CEUR Workshop Proceedings Vol-2409. 1–7.
    Google ScholarLocate open access versionFindings
  • Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In WSDM. 201–210.
    Google ScholarFindings
  • Chris Kamphuis and Arjen de Vries. 2019. The OldDog Docker Image for OSIRRC at SIGIR 2019. In CEUR Workshop Proceedings Vol-2409. 47–49.
    Google ScholarLocate open access versionFindings
  • Chris Kamphuis, Arjen de Vries, Leonid Boytsov, and Jimmy Lin. 2020. Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. In ECIR.
    Google ScholarFindings
  • Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In ECIR. 408–420.
    Google ScholarLocate open access versionFindings
  • Jimmy Lin and Peilin Yang. 2019. The Impact of Score Ties on Repeatability in Document Ranking. In SIGIR. 1125–1128.
    Google ScholarLocate open access versionFindings
  • Craig Macdonald, Richard McCreadie, Rodrygo L.T. Santos, and Iadh Ounis. 2012. From puppy to maturity: Experiences in developing Terrier. OSIR Workshop at SIGIR, 60–63.
    Google ScholarLocate open access versionFindings
  • Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In CEUR Workshop Proceedings Vol-2409. 50–56.
    Google ScholarLocate open access versionFindings
  • Hannes Mühleisen, Thaer Samar, Jimmy Lin, and Arjen de Vries. 2014. Old Dogs Are Great at New Tricks: Column Stores for IR Prototyping. In SIGIR. 863–866.
    Google ScholarLocate open access versionFindings
  • Andrew Trotman and Matt Crane. 2019. Micro- and Macro-optimizations of SaaT Search. Software: Practice and Experience 49, 5 (2019), 942–950.
    Google ScholarLocate open access versionFindings
  • Andrew Trotman, Xiang-Fei Jia, and Matt Crane. 2012. Towards an Efficient and Effective Search Engine. In SIGIR 2012 Workshop on Open Source Information Retrieval. 40–47.
    Google ScholarLocate open access versionFindings
  • Andrew Trotman, Antti Puurula, and Blake Burgess. 20Improvements to BM25 and Language Models Examined. In ADCS. 58–66.
    Google ScholarLocate open access versionFindings
  • Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene. Journal of Data and Information Quality 10, 4 (2018), Article 16.
    Google ScholarLocate open access versionFindings
  • Ziying Yang, Alistair Moffat, and Andrew Turpin. 20How Precise Does Document Scoring Need to Be?. In AIRS. 279–291.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments