TANDO^+: Corpus and Baselines for Document-level Machine Translation in Basque-Spanish and Basque-French

crossref(2024)

引用 0|浏览1
暂无评分
摘要
Abstract Context-aware Neural Machine Translation can potentially enhance automated translation quality through effective modelling of context beyond the sentence level. However, suitable corpora for contextual modelling are still scarce, presenting a significant challenge for the training and evaluation of context-aware systems. To address this challenge, we describe \textsc{tando^+}, a document-level corpus for the under-resourced language pairs Basque-French and Basque-Spanish. We provide a detailed description of this corpus, which is to be shared with the scientific community. The corpus comprises parallel data from diverse domains (literature, subtitles, and news) and incorporates context-level information. Additionally, it provides manually crafted contrastive test sets for Basque-Spanish, designed for comprehensive assessment of gender and register contextual phenomena. Additionally, we train and evaluate sentence-level baseline models and several state-of-the-art contextual variants. Our results and analyses indicate that the corpus is well-suited to train and evaluate context-aware machine translation systems for the two selected under-resourced language pairs.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要