Efficient data selection for machine translation.

Arindam Mandal,Dimitra Vergyri,Wen Wang,Jing Zheng,Andreas Stolcke,Gökhan Tür,Dilek Hakkani-Tür,Necip Fazil Ayan

SLT（2008）

引用 24|浏览47

暂无评分

摘要

Performance of statistical machine translation (SMT) systems relies on the availability of a large parallel corpus which is used to estimate translation probabilities. However, the generation of such corpus is a long and expensive process. In this paper, we introduce two methods for efficient selection of training data to be translated by humans. Our methods are motivated by active learning and aim to choose new data that adds maximal information to the currently available data pool. The first method uses a measure of disagreement between multiple SMT systems, whereas the second uses a perplexity criterion. We performed experiments on Chinese-English data in multiple domains and test sets. Our results show that we can select only one-fifth of the additional training data and achieve similar or better translation performance, compared to that of using all available data.

查看译文

关键词

language translation,learning (artificial intelligence),natural language processing,probability,statistical analysis,Chinese-English data,active learning,data pool,data selection,parallel corpus,perplexity criterion,statistical machine translation systems,training data,translation performance,translation probability,data selection,machine translation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要