AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This paper proposes Modularized Reranking System, a modular Transformer ranking framework that decouples ranking into Document Representation, Query Representation, and Interaction

Modularized Transfomer based Ranking Framework

EMNLP 2020, pp.4180-4190, (2020)

Cited by: 0|Views356
Full Text
Bibtex
Weibo

Abstract

Recent innovations in Transformer-based ranking models have advanced the state-of-the-art in information retrieval. However, these Transformers are computationally expensive, and their opaque hidden states make it hard to understand the ranking process. In this work, we modularize the Transformer ranker into separate modules for text repr...More

Code:

Data:

0
Introduction
  • Neural rankers based on Transformer architectures (Vaswani et al, 2017) fine-tuned from BERT (Devlin et al, 2019) achieve current state-of-theart (SOTA) ranking effectiveness (Nogueira and Cho, 2019; Craswell et al, 2019).
  • The entire ranker runs like a black box and hidden states have no explicit meanings.
  • This represents a clear distinction from earlier neural ranking models that keep separate text representation and distance functions.
  • Transformer rankers are slow (Nogueira et al, 2019), and the black-box design makes it hard to interpret their behavior
Highlights
  • Neural rankers based on Transformer architectures (Vaswani et al, 2017) fine-tuned from BERT (Devlin et al, 2019) achieve current state-of-theart (SOTA) ranking effectiveness (Nogueira and Cho, 2019; Craswell et al, 2019)
  • The results show that Modularized Reranking System (MORES) can achieve ranking accuracy competitive with state-of-the-art ranking models, and suggest that the entangled and computationally expensive full-attention Transformer can be replaced by MORES’s lightweight, modularized design
  • We investigate Interaction Blocks (IB) initialization and compare MORES 2× IB initialized by our proposed initialization method, with a random initialization method
  • State-of-the-art neural rankers based on the Transformer architecture consider all token pairs in a concatenated query and document sequence
  • This paper proposes MORES, a modular Transformer ranking framework that decouples ranking into Document Representation, Query Representation, and Interaction
Methods
  • A typical Transformer ranker takes in the concatenation of a query qry and a document doc as input.
  • The Transformer generates a new contextualized embedding for each token based on its attention to all tokens in the concatenated text.
  • This formulation poses two challenges.
  • In terms of speed, the attention consumes time quadratic to the input length.
  • As query and document attention is entangled from the first layer, it is challenging to interpret the model
Conclusion
  • State-of-the-art neural rankers based on the Transformer architecture consider all token pairs in a concatenated query and document sequence.
  • Though effective, they are slow and challenging to interpret.
  • This paper proposes MORES, a modular Transformer ranking framework that decouples ranking into Document Representation, Query Representation, and Interaction.
  • MORES is effective while being efficient and interpretable
Tables
  • Table1: Time complexity of MORES and a typical Transformer ranker, e.g., a standard BERT ranker. We write q for query length, d for document length, n for Transformer’s hidden layer dimension, and Ndoc for number of candidate documents to be ranked for each query. For interaction, Reuse-S1 corresponds to document representation reuse strategy, and Reuse-S2 projected document representation reuse strategy
  • Table2: Effectiveness of MORES models and baseline rankers on the MS MARCO Passage Corpus. ∗ and † indicate non-inferiority (Section 4.1) with p < 0.05 to the BERT ranker using a 5% or 2% margin, respectively
  • Table3: Ranking Accuracy of MORES when using / not using attention weights copied from BERT to initialize Interaction Module. The models were tested on the MS MARCO dataset with the Dev Queries
  • Table4: Average time in seconds to evaluate one query with 1,000 candidate documents, and the space used to store pre-computed representations for each document. Len: input document length
  • Table5: Domain adaptation on ClueWeb09-B. adapt-interaction and adapt-representation use MORES 2× IB. ∗ and † indicate non-inferiority (Section 4.1) with p < 0.05 to the BERT ranker using a 5% or 2% margin, respectively
  • Table6: Domain adaptation on Robust04. adapt-interaction and adapt-representation use MORES 2× IB. ∗ and † indicate non-inferiority (Section 4.1) with p < 0.05 to the BERT ranker using a 5% or 2% margin, respectively
Download tables as Excel
Related work
  • Neural ranking models for IR proposed in previous studies can be generally classified into two groups (Guo et al, 2016): representation-based models, and interaction-based models.

    Representation-based models learn latent vectors (embeddings) of queries and documents and use a simple scoring function (e.g., cosine) to measure the relevance between them. Such methods date back to LSI (Deerwester et al, 1990) and classical siamese networks (Bromley et al, 1993). More recent research considered using modern deep learning techniques to learn the representations. Examples include DSSM (Huang et al, 2013), C-DSSM (Shen et al, 2014), etc. Representations-based models are efficient during evaluation because the document representations are independent of the query, and therefore can be pre-computed. However, compressing a document into a single low-dimensional vector loses specific term matching signals (Guo et al, 2016). As a result, previous representation-based ranking models mostly fail to outperform interactionbased ones.
Funding
  • This work was supported in part by National Science Foundation (NSF) grant IIS-1815528
Study subjects and analysis
Transformers: 3
3.3 MORES Training and Initialization. MORES needs to learn three Transformers: two Representation Modules and one Interaction Module. The three Transformer modules are coupled during training and decoupled when used

Transformers: 3
The three Transformer modules are coupled during training and decoupled when used. To train MORES, we connect the three Transformers and enforce module coupling with end-to-end training using the pointwise loss function (Dai and Callan, 2019). When training is finished, we store the three Transformer modules separately and apply each module at the desired offline/online time

documents: 1000
Following Craswell et al (2019), we used MRR, NDCG@10, and MAP@1000 as evaluation metrics. All methods were evaluated in a reranking task to re-rank the top 1000 documents of the MS MARCO official BM25 retrieval results. We test MORES effectiveness with a varied number of Interaction Blocks (IB) to study the effects of varying the complexity of query-document interaction

candidate documents: 1000
Additional IB layers incur more computation but do not improve effectiveness, and are hence not considered. We record average time for ranking one query with 1000 candidate documents on an 8-core CPU and a single GPU.5. We measured ranking speed with documents of length 128 and 512 with a fixed query length of 16

candidate documents: 1000
Ranking Accuracy of MORES when using / not using attention weights copied from BERT to initialize Interaction Module. The models were tested on the MS MARCO dataset with the Dev Queries. Average time in seconds to evaluate one query with 1,000 candidate documents, and the space used to store pre-computed representations for each document. Len: input document length. Domain adaptation on ClueWeb09-B. adapt-interaction and adapt-representation use MORES 2× IB. ∗ and † indicate non-inferiority (Section 4.1) with p < 0.05 to the BERT ranker using a 5% or 2% margin, respectively

Reference
  • Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. 1993. Signature verification using a Siamese time delay neural network. In Advances in Neural Information Processing Systems, pages 737–744.
    Google ScholarLocate open access versionFindings
  • Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
    Findings
  • Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2019. Overview of the trec 2019 deep learning track. In TREC (to appear).
    Google ScholarFindings
  • Zhuyun Dai and Jamie Callan. 2019. Deeper text understanding for ir with contextual neural language modeling. In The 42nd International ACM SIGIR Conference on Research & Development in Information Retrieval.
    Google ScholarLocate open access versionFindings
  • Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
    Google ScholarFindings
  • Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management, pages 55–64.
    Google ScholarLocate open access versionFindings
  • Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531.
    Findings
  • Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry P. Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In 22nd ACM International Conference on Information and Knowledge Management, pages 2333–2338.
    Google ScholarLocate open access versionFindings
  • Gaya K Jayasinghe, William Webber, Mark Sanderson, Lasitha S Dharmasena, and J Shane Culpepper. 2015. Statistical comparisons of non-deterministic ir systems using two dimensional variance. Information Processing & Management, 51(5):677–694.
    Google ScholarLocate open access versionFindings
  • Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
    Findings
  • Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with BERT. arXiv:1901.04085.
    Findings
  • Rodrigo Nogueira, Kyunghyun Cho, Yang Wei, Lin Jimmy, and Kyunghyun Cho. 2019. Document expansion by query prediction. arXiv:1904.08375.
    Findings
  • Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035.
    Google ScholarLocate open access versionFindings
  • Yifan Qiao, Chenyan Xiong, Zheng-Hao Liu, and Zhiyuan Liu. 2019. Understanding the behaviors of BERT in ranking. CoRR, abs/1904.07531.
    Findings
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019.
    Google ScholarFindings
  • Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. 2014. Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd International World Wide Web Conference, pages 373–374.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
    Google ScholarFindings
  • Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 20Analyzing multihead self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418.
    Findings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Zhijing Wu, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. Investigating passage-level relevance and its role in document-level relevance judgment. In Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
    Google ScholarLocate open access versionFindings
  • Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 55–64.
    Google ScholarLocate open access versionFindings
  • Training Details On MS MARCO passage ranking dataset, we trained MORES over a 2M subset of Marco’s training set. We use stochastic gradient descent to train the model with a batch size of 128. We use AdamW optimizer with a learning rate of 3e-5, a warm-up of 1000 steps and a linear learning rate scheduler for all MORES variants. Our baseline BERT model is trained with similar training setup to match performance reported in (Nogueira and Cho, 2019). We have not done hyper-parameter search, and all training setup is inherited from GLUE example in the huggingface transformer code base (Wolf et al., 2019). Following (Dai and Callan, 2019), we run a domain adaptation experiment on ClueWeb09B: we take trained model on MS MARCO, and continue training over ClueWeb09-B’s training data in a 5-fold cross-validation setup. We use a batch size of 32 and a learning rate of 5e-6. We select from batch size of 16 and 32, learning rate of 5e-6, 1e-5 and 2e-5 by validation point-wise accuracy.
    Google ScholarLocate open access versionFindings
  • The first is available at https://microsoft.
    Findings
  • github.io/msmarco/ and the latter two at http://boston.lti.cs.cmu.edu/
    Findings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科