Can a Multichoice Dataset be Repurposed for Extractive Question Answering?
arxiv(2024)
摘要
The rapid evolution of Natural Language Processing (NLP) has favored major
languages such as English, leaving a significant gap for many others due to
limited resources. This is especially evident in the context of data
annotation, a task whose importance cannot be underestimated, but which is
time-consuming and costly. Thus, any dataset for resource-poor languages is
precious, in particular when it is task-specific. Here, we explore the
feasibility of repurposing existing datasets for a new NLP task: we repurposed
the Belebele dataset (Bandarkar et al., 2023), which was designed for
multiple-choice question answering (MCQA), to enable extractive QA (EQA) in the
style of machine reading comprehension. We present annotation guidelines and a
parallel EQA dataset for English and Modern Standard Arabic (MSA). We also
present QA evaluation results for several monolingual and cross-lingual QA
pairs including English, MSA, and five Arabic dialects. Our aim is to enable
others to adapt our approach for the 120+ other language variants in Belebele,
many of which are deemed under-resourced. We also conduct a thorough analysis
and share our insights from the process, which we hope will contribute to a
deeper understanding of the challenges and the opportunities associated with
task reformulation in NLP research.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要