AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We present CLIRMatrix, the largest and the most comprehensive collection of bilingual and multilingual Cross-Lingual Information Retrieval datasets to date

CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval

Conference on Empirical Methods in Natural Language Processing, (2020): 4160-4170

Cited: 0|Views410
EI

Abstract

We present CLIRMatrix, a massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. CLIRMatrix comprises (1) BI-139, a bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs, an...More

Code:

Data:

0
Introduction
  • Cross-Lingual Information Retrieval (CLIR) is a retrieval task in which search queries and candidate documents are written in different languages.
  • The research community has been actively looking at end-to-end solutions that tackle the CLIR task without the need to build MT systems
  • This line of work builds upon recent advances in Neural Information Retrieval in the monolingual setting, c.f.
  • One can exploit cross-lingual word embeddings to train a CLIR model on disjoint monolingual corpora (Litschko et al, 2018)
Highlights
  • Cross-Lingual Information Retrieval (CLIR) is a retrieval task in which search queries and candidate documents are written in different languages
  • Translation-based approaches are commonly used to tackle the CLIR task (Zhou et al, 2012; Oard, 1998; McCarley, 1999): the query translation approach translates the query into the same language of the documents, whereas the document translation approach translates the document into the same language as the query
  • As we can see in Table 3, the multilingual model (MM) model performs better than the respective bilingual model (BM) models in most language directions
  • We present CLIRMatrix, the largest and the most comprehensive collection of bilingual and multilingual CLIR datasets to date
  • Our mixlanguage retrieval experiments on MULTI-8 show that a single multilingual model can significantly outperform the combination of multiple bilingual models
Methods
  • Let qX be a query in language X, and dY be a document in language Y.
  • {(qiX , dYij , rij )}i=1,2,...,I (1).
  • J, meaning that each query qiX searches over the full set of documents {dYij}j=1,...,J.
  • In the re-ranking setup, each query qiX searches over a subset of documents obtained by an initial fullcollection retrieval engine: {dYij}j=1,...,Ki, where Ki J.
  • Machine learning approaches to IR focus on the re-ranking setup with Ki set to 10∼1000 (Liu, 2009; Chapelle and Chang, 2011).
Results
  • Results on

    BI-139

    The authors present results on the 138 target languages for English queries. For each language direction, the authors trained a baseline CLIR model on the base train set and kept the checkpoint with the best

    NDCG@10 performance on the base validation set.
  • The authors trained a baseline CLIR model on the base train set and kept the checkpoint with the best.
  • As the authors can see in Table 3, the MM model performs better than the respective BM models in most language directions.
  • This suggests that multilingual training is a promising research direction even for singlelanguage retrieval
Conclusion
  • Conclusion and future work

    The authors present CLIRMatrix, the largest and the most comprehensive collection of bilingual and multilingual CLIR datasets to date.
  • The BI-139 dataset supports CLIR in 139×138 language pairs, whereas the MULTI-8 dataset enables mix-language retrieval in 8 languages.
  • The large number of supported language directions allows the research community to explore and build new models for many more languages, especially the low-resource ones.
  • The authors' mixlanguage retrieval experiments on MULTI-8 show that a single multilingual model can significantly outperform the combination of multiple bilingual models.
  • Zero-shot CLIR models for low-resource languages, 2.
  • The authors think it will be interesting to look at: 1. zero-shot CLIR models for low-resource languages, 2. comparison of end-to-end neural rankers with traditional translation+IR pipelines in terms of both scalability, cost, and retrieval accuracy, 3. advanced neural architectures and training algorithms that can exploit the large training data, 4. building universal models for multilingual IR
Tables
  • Table1: Results of 138 language directions from BI-139 base with English queries. In each cell, the top shows a candidate’s language code and the bottom shows the NDCG@10 score for that language direction
  • Table2: Different ways of using MULTI-8. A refers to the concatenation of all languages, which is used in mixlanguage retrieval. S and T refer to the queries/documents in the source and target language under consideration for the bilingual case (i.e., single-language retrieval similar to BI-139 setups). For either, it is possible to train either bilingual models (BM) based on pairwise data or a multilingual model (MM) based on all language data
  • Table3: MULTI-8 single-language retrieval results of bilingual models (BM). The rows are the source query language, and the columns are the target document language. The up arrows next to NDCG@10 scores indicate instances where the multilingual model (MM) outperforms the bilingual models
  • Table4: MULTI-8 mix-language retrieval results. % shows percent improvement of MM over BM z-norm
  • Table5: Comparison of CLIR datasets by number of languages (#Lang), whether it is manually constructed or supports multilingual retrieval, and data statistics. Large #query and #triplets are needed for neural training
Download tables as Excel
Related work
  • Information retrieval (IR) has made a tremendous amount of progress, shifting focus from traditional bag-of-world retrieval functions such as tfidf (Salton and McGill, 1986) and BM25 (Robertson et al, 2009), to neural IR models (Guo et al, 2016; Hui et al, 2018; McDonald et al, 2018) which have shown promising results on multiple monolingual IR datasets. Recent advances in pretrained language models such as BERT (Devlin et al, 2019) have also led to significant improvements in IR tasks. For example, MacAvaney et al (2019) achieves state-of-the-art performances on benchmark datasets by incorporating BERT’s context vectors into existing baseline neural IR models (McDonald et al, 2018). Training on synthetic is also a common practice, e.g., Dehghani et al (2017) show that supervised neural ranking models can greatly benefit from pre-training on BM25 labels.

    Cross-lingual Information Retrieval (CLIR) is a sub-field of IR that is becoming increasingly important as new documents in different languages are being generated every day. The field has progressed from translation-based methods (Zhou et al, 2012; Oard, 1998; McCarley, 1999; Yarmohammadi et al, 2019) to recent neural CLIR models (Vulicand Moens, 2015; Litschko et al, 2018; Zhang et al, 2019) that rely on cross-lingual word embeddings. In contrast to the wide availability of monolingual IR datasets (Voorhees, 2005; Craswell et al, 2020), cross-lingual and multilingual IR
Funding
  • On MULTI-8, we show that a single multilingual model can significantly outperform an ensemble of bilingual models
  • As seen in Table 4, the multilingual model performs significantly better than the ensembled/merged bilingual models
Study subjects and analysis
language pairs: 19182
We present CLIRMatrix, a massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. CLIRMatrix comprises (1) BI-139, a bilingual dataset of queries in one language matched with relevant documents in another language for 139×138=19,182 language pairs, and (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages. In total, we mined 49 million unique queries and 34 billion (query, document, label) triplets, making it the largest and most comprehensive CLIR dataset to date

language pairs: 19182
In total, we were able to mine 49 million unique queries in 139 languages and 34 billion (query, document, label). Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4160–4170, November 16–20, 2020. c 2020 Association for Computational Linguistics triplets, creating a CLIR collection across a matrix of 139 × 138 = 19, 182 language pairs. From this raw collection, we introduce two datasets:

language pairs: 19182
From this raw collection, we introduce two datasets:. • BI-139 is a massively large bilingual CLIR dataset that covers 139 × 138 = 19, 182 language pairs. To encourage reproducibility, we present standard train, validation, and test subsets for every language direction

documents: 4
We first run monolingual IR to find English documents that answer the query. In this figure, 4 documents are returned, and we attempt to link to the corresponding Chinese versions using Wikidata information. When the link is available, we set the relevance label rij for Chinese documents using the English-based IR system’s predictions rij; all other documents are deemed not relevant

relevant documents: 100
We index the documents into an Elasticsearch2 search engine, which serves as our monolingual IR system. Using the extracted titles as search queries, we retrieve the top 100 relevant documents and their corresponding BM25 scores from Elasticsearch for every query. We then convert the BM25 scores into discrete relevance judgment labels using Jenks natural break optimization

documents: 10000
Inspired by Schwenk et al (2019), we extracted document ids, titles, and bodies from Wikipedia’s search indices3 instead, which contain raw text data without meta-information. Wikipedia dumps We discarded dumps with less than ten thousand documents, which are usually the dumps of Wikipedia of certain dialects and less commonly used languages. We are left with Wikipedia dumps in 139 languages, containing a good mix of high-, mid- and low-resource languages

tokens of Wikipedia articles: 100
Truncating the documents is necessary for several reasons: First, shorter documents are more friendly to neural models that are bounded by GPU memories. Second, the first few hundred tokens of Wikipedia articles are usually the main points of the full text, thus are more likely to be topically similar across languages. Last but not least, BM25 tends to overpenalize long documents, which can lead to suboptimal IR performances (Lv and Zhai, 2011)

documents: 100
For every query, we configured Elasticsearch to search both document titles and document bodies, with twice the weight given to document titles. We limit Elasticsearch to return only the top 100 documents for each query and assume documents not returned by the search engine are irrelevant. We parallelized the retrieval processes by running multiple Elasticsearch instances on numerous servers and dedicated one Elasticsearch instance to every language

candidate documents: 100
We ensured that queries in the train and validation/test sets of one language direction do not overlap with the queries in the test sets from other language directions. For every query, we ensure there are precisely K =100 candidate documents by filling the shortfall with random irrelevant documents. MULTI-8 This is a multilingual CLIR dataset covering 8 languages from various regions of the world (Arabic, German, English, Spanish, French, Japanese, Russian, and Chinese)

documents: 100
First, we restricted queries to those with a relevant document (rij = 6) in all 8 languages. Then, for each query qiX , we use the monolingual IR systems to collect 100 documents in the same language dXij .10. Similar to BI-139 base, if ElasticSearch returns less than 100 documents labels (rij ≥ 1), then we fill-up the short-fall with random irrelevant documents with label rij = 0

documents: 100
Then, for each query qiX , we use the monolingual IR systems to collect 100 documents in the same language dXij .10. Similar to BI-139 base, if ElasticSearch returns less than 100 documents labels (rij ≥ 1), then we fill-up the short-fall with random irrelevant documents with label rij = 0. Finally, we merge these document lists such that for any query in language X, we have 7 × 100 documents in the other 7 languages

documents: 100
Similar to BI-139 base, if ElasticSearch returns less than 100 documents labels (rij ≥ 1), then we fill-up the short-fall with random irrelevant documents with label rij = 0. Finally, we merge these document lists such that for any query in language X, we have 7 × 100 documents in the other 7 languages. Similar to the base version of BI-139, the train sets contain 10,000 queries, while validation, test1, and test2 sets contain 1,000 queries; but note the query sets are different

training pairs: 1000
We then optimize the parameters with pairwise hinge loss and Adam optimizer. We trained all models for 20 epochs and sampled around 1,000 training pairs for each epoch. At inference time, we rerank documents based on the output scores from the BERT ranker model

returned documents: 10
Evaluation metric We report all results in NDCG (normalized discounted cumulative gain), an IR metric that measures the usefulness of documents based on their ranks in the search results (Jarvelin and Kekalainen, 2002). Following a common practice from the IR community, we calculate NDCG@10, which only evaluates the top 10 returned documents. For a given query, let ρi be the relevance judgment label of the i-th document in the predicted document ranking and φi be the relevance judgment label of the i-th document in the optimal document ranking

pairs: 56
MULTI-8 enables evaluation in two kinds of scenarios (see Table 2): Single-language retrieval This scenario is similar to BI-139 in terms of evaluation, i.e. during test we only have queries in source language qX = Stest and documents in one target language dY = Ttest. We divide MULTI-8 test set into 8 × 7 = 56 pairs. For training, we compare bilingual model (BMS→ T) trained in every language pair, against a multilingual model (MM) trained on data concatenated from all 56 language directions

English documents: 1226741
Extracting CLIR datasets from Wikipedia has been explored in previous work. Schamoni et al (2014) build a German–English bilingual CLIR dataset from Wikipedia, which contains 245,294 German queries and 1,226,741 English documents. They convert the first sentences from German Wikipedia documents into queries and follow Wikipedia’s interlanguage links to find relevant documents in English

language pairs: 138
Comparison of CLIR datasets by number of languages (#Lang), whether it is manually constructed or supports multilingual retrieval, and data statistics. Large #query and #triplets are needed for neural training. Illustration of our CLIRMatrix collection. The BI-139 portion of CLIRMatrix supports research in bilingual retrieval and covers a matrix of 139 × 138 language pairs. The MULTI-8 portion of CLIRMatrix supports research in multilingual modeling and mixedlanguage (ML) retrieval, where queries and documents are jointly aligned over 8 languages. Intuition of CLIR relevance label synthesis. For the English query “Barack Obama”, first a monolingual IR engine (Elasticsearch) labels documents in English; then Wikidata links are exploited to propagate the label to the corresponding Chinese documents, which are assumed to be topically similar

Reference
  • James Allan and Hema Raghavan. 2002. Using partof-speech patterns to reduce query ambiguity. In Proceedings of the 25th annual international ACM
    Google ScholarLocate open access versionFindings
  • Nicholas J Belkin, Diane Kelly, G Kim, J-Y Kim, HJ Lee, Gheorghe Muresan, M-C Tang, X-J Yuan, and Colleen Cool. 2003. Query length in interactive information retrieval. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 205–212.
    Google ScholarLocate open access versionFindings
  • Olivier Chapelle and Yi Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the Learning to Rank Challenge, pages 1–24.
    Google ScholarLocate open access versionFindings
  • CLEF 2000-2003. The CLEF Test Suite for the CLEF 2000-2003 Campaigns – Evaluation Package. https://catalog.elra.info/en-us/repository/browse/ELRA-E0008/.
    Findings
  • Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820.
    Findings
  • Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. 2017. Neural ranking models with weak supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 65–74.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nicola Ferro and Gianmaria Silvello. 2015. CLEF 2000-2014: Lessons learnt from ad hoc retrieval. In Proceedings of the 6th Italian Information Retrieval Workshop, Cagliari, Italy, May 25-26, 2015, volume 1404 of CEUR Workshop Proceedings. CEURWS.org.
    Google ScholarLocate open access versionFindings
  • Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 55–64.
    Google ScholarLocate open access versionFindings
  • Kai Hui, Andrew Yates, Klaus Berberich, and Gerard De Melo. 2018. Co-pacrr: A context-aware neural ir model for ad-hoc retrieval. In Proceedings of the eleventh ACM international conference on web search and data mining, pages 279–287.
    Google ScholarLocate open access versionFindings
  • Bernard J Jansen, Danielle L Booth, and Amanda Spink. 2008. Determining the informational, navigational, and transactional intent of web queries. Information Processing & Management, 44(3):1251– 1266.
    Google ScholarLocate open access versionFindings
  • Kalervo Jarvelin and Jaana Kekalainen. 2002. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446.
    Google ScholarLocate open access versionFindings
  • Zhuolin Jiang, Amro El-Jaroudi, William Hartmann, Damianos Karakos, and Lingjun Zhao. 2020. Crosslingual information retrieval with bert.
    Google ScholarFindings
  • Robert Litschko, Goran Glavas, Simone Paolo Ponzetto, and Ivan Vulic. 2018. Unsupervised crosslingual information retrieval using monolingual data only. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1253–1256.
    Google ScholarLocate open access versionFindings
  • Tie-Yan Liu. 2009. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225–331.
    Google ScholarLocate open access versionFindings
  • Yuanhua Lv and ChengXiang Zhai. 2011. When documents are very long, bm25 fails! In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 1103–1104.
    Google ScholarLocate open access versionFindings
  • Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. Cedr: Contextualized embeddings for document ranking. In SIGIR.
    Google ScholarFindings
  • MATERIAL. 2017. Machine Translation for English Retrieval of Information in Any Language (MATERIAL). https://www.iarpa.gov/index.php/research-programs/material.
    Findings
  • J Scott McCarley. 1999. Should we translate the documents or the queries in cross-language information retrieval? In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 208–214. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ryan McDonald, George Brokos, and Ion Androutsopoulos. 2018. Deep relevance ranking using enhanced document-query interactions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1849–1860.
    Google ScholarLocate open access versionFindings
  • Robert McMaster and Susanna McMaster. 2002. A history of twentieth-century american academic cartography. Cartography and Geographic Information Science, 29(3):305–321.
    Google ScholarLocate open access versionFindings
  • Bhaskar Mitra and Nick Craswell. 2018. An introduction to neural information retrieval. Foundations and Trends in Information Retrieval, 13(1):1–126.
    Google ScholarLocate open access versionFindings
  • Douglas W Oard. 1998. A comparative study of query and document translation for cross-language information retrieval. In Conference of the Association for Machine Translation in the Americas, pages 472– 483. Springer.
    Google ScholarLocate open access versionFindings
  • Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends R in Information Retrieval, 3(4):333–389.
    Google ScholarLocate open access versionFindings
  • Gerard Salton and Michael J McGill. 1986. Introduction to modern information retrieval.
    Google ScholarFindings
  • Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui. 2018. Cross-lingual learning-to-rank with shared representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 458–463.
    Google ScholarLocate open access versionFindings
  • Jacques Savoy. 2003. Report on clef-2003 multilingual tracks. In Workshop of the Cross-Language Evaluation Forum for European Languages, pages 64–73. Springer.
    Google ScholarLocate open access versionFindings
  • Jacques Savoy and Martin Braschler. 2019. Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEF, pages 177–200. Springer International Publishing, Cham.
    Google ScholarFindings
  • Shigehiko Schamoni, Felix Hieber, Artem Sokolov, and Stefan Riezler. 2014. Learning translational and knowledge-based similarities from relevance rankings for cross-language retrieval. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 488–494.
    Google ScholarLocate open access versionFindings
  • Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzman. 2019. Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. arXiv preprint arXiv:1907.05791.
    Findings
  • Ming-Feng Tsai, Yu-Ting Wang, and Hsin-Hsi Chen. 2008. A study of learning a merge model for multilingual information retrieval. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 195–202.
    Google ScholarLocate open access versionFindings
  • Ellen M Voorhees. 2005. The trec robust retrieval track. In ACM SIGIR Forum, volume 39, pages 11– 20. ACM New York, NY, USA.
    Google ScholarLocate open access versionFindings
  • Ivan Vulicand Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pages 363–372.
    Google ScholarLocate open access versionFindings
  • Mahsa Yarmohammadi, Xutai Ma, Sorami Hisamoto, Muhammad Rahman, Yiming Wang, Hainan Xu, Daniel Povey, Philipp Koehn, and Kevin Duh. 2019. Robust document representations for cross-lingual information retrieval in low-resource settings. In Proceedings of Machine Translation Summit XVII Volume 1: Research Track, pages 12–20, Dublin, Ireland. European Association for Machine Translation.
    Google ScholarLocate open access versionFindings
  • Ilya Zavorin, Aric Bills, Cassian Corey, Michelle Morrison, Audrey Tong, and Richard Tong. 2020. Corpora for cross-language information retrieval in six less-resourced languages. In Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020), pages 7–13, Marseille, France. European Language Resources Association.
    Google ScholarLocate open access versionFindings
  • Rabih Zbib, Lingjun Zhao, Damianos Karakos, William Hartmann, Jay DeYoung, Zhongqiang Huang, Zhuolin Jiang, Noah Rivkin, Le Zhang, Richard Schwartz, and John Makhoul. 2019. Neuralnetwork lexical translation for cross-lingual ir from text and speech. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, page 645–654, New York, NY, USA. Association for Computing Machinery.
    Google ScholarLocate open access versionFindings
  • Rui Zhang, Caitlin Westerfield, Sungrok Shim, Garrett Bingham, Alexander Richard Fabbri, William Hu, Neha Verma, and Dragomir Radev. 2019. Improving low-resource cross-lingual document retrieval by reranking with deep bilingual representations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3173– 3179.
    Google ScholarLocate open access versionFindings
  • Dong Zhou, Mark Truran, Tim Brailsford, Vincent Wade, and Helen Ashman. 2012. Translation techniques in cross-language information retrieval. ACM Computing Surveys (CSUR), 45(1):1–44.
    Google ScholarLocate open access versionFindings
Author
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn