Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders.
CoRR(2023)
摘要
Prevailing research practice today often relies on training dense retrievers
on existing large datasets such as MSMARCO and then experimenting with ways to
improve zero-shot generalization capabilities to unseen domains. While prior
work has tackled this challenge through resource-intensive steps such as data
augmentation, architectural modifications, increasing model size, or even
further base model pretraining, comparatively little investigation has examined
whether the training procedures themselves can be improved to yield better
generalization capabilities in the resulting models. In this work, we recommend
a simple recipe for training dense encoders: Train on MSMARCO with
parameter-efficient methods, such as LoRA, and opt for using in-batch negatives
unless given well-constructed hard negatives. We validate these recommendations
using the BEIR benchmark and find results are persistent across choice of dense
encoder and base model size and are complementary to other resource-intensive
strategies for out-of-domain generalization such as architectural modifications
or additional pretraining. We hope that this thorough and impartial study
around various training techniques, which augments other resource-intensive
methods, offers practical insights for developing a dense retrieval model that
effectively generalizes, even when trained on a single dataset.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要