Building Efficient and Effective OpenQA Systems for Low-Resource Languages
CoRR(2024)
摘要
Question answering (QA) is the task of answering questions posed in natural
language with free-form natural language answers extracted from a given
passage. In the OpenQA variant, only a question text is given, and the system
must retrieve relevant passages from an unstructured knowledge source and use
them to provide answers, which is the case in the mainstream QA systems on the
Web. QA systems currently are mostly limited to the English language due to the
lack of large-scale labeled QA datasets in non-English languages. In this
paper, we show that effective, low-cost OpenQA systems can be developed for
low-resource languages. The key ingredients are (1) weak supervision using
machine-translated labeled datasets and (2) a relevant unstructured knowledge
source in the target language. Furthermore, we show that only a few hundred
gold assessment examples are needed to reliably evaluate these systems. We
apply our method to Turkish as a challenging case study, since English and
Turkish are typologically very distinct. We present SQuAD-TR, a machine
translation of SQuAD2.0, and we build our OpenQA system by adapting ColBERT-QA
for Turkish. We obtain a performance improvement of 9-34
13-33
reader models by using two versions of Wikipedia dumps spanning two years. Our
results show that SQuAD-TR makes OpenQA feasible for Turkish, which we hope
encourages researchers to build OpenQA systems in other low-resource languages.
We make all the code, models, and the dataset publicly available.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要