Improving Evaluation Of Machine Translation Quality Estimation

Yvette Graham

PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1（2015）

引用 54|浏览515

暂无评分

摘要

Quality estimation evaluation commonly takes the form of measurement of the error that exists between predictions and gold standard labels for a particular test set of translations. Issues can arise during comparison of quality estimation prediction score distributions and gold label distributions, however. In this paper, we provide an analysis of methods of comparison and identify areas of concern with respect to widely used measures, such as the ability to gain by prediction of aggregate statistics specific to gold label distributions or by optimally conservative variance in prediction score distributions. As an alternative, we propose the use of the unit-free Pearson correlation, in addition to providing an appropriate method of significance testing improvements over a baseline. Components of WMT-13 and WMT-14 quality estimation shared tasks are replicated to reveal substantially increased conclusivity in system rankings, including identification of outright winners of tasks.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要